Inverse optimal control from incomplete trajectory observations

Abstract

This article develops a methodology that enables learning an objective function of an optimal control system from incomplete trajectory observations. The objective function is assumed to be a weighted sum of features (or basis functions) with unknown weights, and the observed data is a segment of a trajectory of system states and inputs. The proposed technique introduces the concept of the recovery matrix to establish the relationship between any available segment of the trajectory and the weights of given candidate features. The rank of the recovery matrix indicates whether a subset of relevant features can be found among the candidate features and the corresponding weights can be learned from the segment data. The recovery matrix can be obtained iteratively and its rank non-decreasing property shows that additional observations may contribute to the objective learning. Based on the recovery matrix, a method for using incomplete trajectory observations to learn the weights of selected features is established, and an incremental inverse optimal control algorithm is developed by automatically finding the minimal required observation. The effectiveness of the proposed method is demonstrated on a linear quadratic regulator system and a simulated robot manipulator.

Keywords

Inverse optimal control inverse reinforcement learning incomplete trajectory observations objective learning

1. Introduction

Inverse optimal control (IOC), also known as inverse reinforcement learning, solves a problem of finding an objective (e.g., cost or reward) function to explain behavioral observations of an optimal control system (Kalman, 1964; Ng and Russell, 2000). Successful applications of IOC techniques are broad, including imitation learning (Abbeel et al., 2010; Kolter et al., 2008), where a learner mimics an expert by inferring an objective function from the expert’s demonstrations, autonomous driving (Kuderer et al., 2015), where human driving preference is learned and transferred to a vehicle controller, human–robot shared systems (Mainprice et al., 2015, 2016), where the intentionality of a human partner is estimated to enable motion prediction and smooth coordination, and human motion analysis (Jin et al., 2019; Lin et al., 2016), where principles of human motor control are investigated.

The most common strategy used in IOC is to parametrize an unknown objective function as a weighted sum of relevant features (or basis functions) with unknown weights (Abbeel and Ng, 2004; Englert et al., 2017;Mombaur et al., 2010; Ng and Russell, 2000; Ratliff et al., 2006; Ziebart et al., 2008). Different approaches have been developed to estimate the weights of the features given an observation of the system’s optimal trajectory over a complete motion horizon, i.e., a full trajectory observation. These methods cannot deal with the case when only incomplete trajectory observations are available, that is, only a portion/segment of the system’s trajectory within a small time interval of the horizon is available. A method capable of learning objective functions from incomplete trajectory data will be beneficial for multiple reasons: first, an observation of the system’s trajectory over a complete time horizon may not be accessible due to limited sensing capabilities, sensor failures, occlusions, etc.; second, the computational cost of existing IOC techniques based on full trajectory data may be large; and, third, learning objective functions from incomplete trajectory observations may help to address some other challenging problems such as learning time-varying objective functions (Jin et al., 2019), online motion prediction (Pérez-D’Arpino and Shah, 2015), and learning from human corrections (Bajcsy et al., 2018). Under these motivations, this article aims to develop a technique to enable learning an objective function only from incomplete trajectory observations.

1.1. Related work

Existing IOC methods can be categorized based on whether the forward optimal control problem needs to be computed within the learning process. The first category of existing works is based on a nested architecture, where the feature weights are updated in an outer loop while the corresponding optimal control problem is solved in an inner loop. Different methods of this type focus on different strategies to update the feature weights in the outer layer. Representative methods include (Abbeel and Ng, 2004), where the feature weights are updated towards matching the feature values of the reproduced optimal trajectories with the demonstrations, (Ratliff et al., 2006), where the feature weights are solved by maximizing the margin between the objective function value of the observed trajectories and the value of any simulated optimal trajectories, and (Ziebart et al., 2008), where the feature weights are optimized such that the probability distribution of system’s trajectories maximizes the entropy while matching the empirical feature values of demonstrations. These nested IOC methods have been successfully applied to humanoid locomotion (Park and Levine, 2013), autonomous vehicles (Kuderer et al., 2015), robot navigation (Vasquez et al., 2014), learning from human corrections (Bajcsy et al., 2017; Jin et al., 2020), etc. In (Mombaur et al., 2010), the weights are learned by minimizing the deviation of the reproduced optimal trajectory from the observed one; the similar methods have been applied to studying human walking (Clever et al., 2016) and arm motion (Berret et al., 2011).

A drawback of the nested IOC methods is the need to solve optimal control problems repeatedly in the inner loop, thus those methods usually suffer from huge computational cost. This motivates the second line of IOC methods, which seek to directly solve for the unknown feature weights. A key idea used in these methods is to establish optimality conditions which the observed optimal data must satisfy. For example, in (Keshavarz et al., 2011), the Karush–Kuhn–Tucker (KKT) (Boyd and Vandenberghe, 2004) optimality conditions are established, based on which the feature weights are then solved by minimizing a loss that quantifies the violation of such conditions by the observed data. (Puydupin-Jamin et al., 2012) apply such a KKT-based method to solve IOC problems and study the objective function underlying human locomotion. In (Johnson et al., 2013), Pontryagin’s minimum principle (Pontryagin et al., 1962) is utilized to formulate a residual optimization over the unknown weights. These methods have been applied successfully to the locomotion analysis (Aghasadeghi and Bretl, 2014; Puydupin-Jamin et al., 2012), walking path generation (Papadopoulos et al., 2016), human motion segmentation (Jin et al., 2019; Lin et al., 2016), etc. (Englert et al., 2017) proposed an inverse KKT method to enable a robot to learn manipulation tasks. Recently, along this direction, the recoverability for IOC problems has been investigated. For example, when an optimal control system remains at an equilibrium point, although its trajectory still satisfies the optimality conditions, it is uninformative for learning the objective function. This issue is discussed in (Molloy et al., 2018, 2016), where a sufficient condition for recovering weights from full trajectory observations is proposed.

Existing IOC techniques cited previously require a full observation of a system trajectory, that is, optimal trajectory data of the system states and inputs over a complete motion horizon. To the best of the authors’ knowledge, the IOC problems based on incomplete trajectory observations are rarely investigated. By an incomplete trajectory observation, we mean that the observed data is a segment or portion of the system trajectory within a small time interval of the horizon. We consider the objective learning using incomplete trajectory data mainly due to the following motivations. First, in certain practical cases, for example, owing to limited sensing, sensors’ failure, or occlusions, a full observation of system’s trajectory data may not be available. Second, although direct IOC methods improve the computation efficiency compared with the nested counterparts, the computational cost is still significant especially when handling complex systems with high-dimensional action/state space and long time horizons. Third, successfully learning objective functions only using incomplete trajectory data would potentially benefit for addressing many challenging problems such as identifying time-varying objective functions (Jin et al., 2019), learning from human corrections (Bajcsy et al., 2018), and online long-term motion prediction (Mainprice et al., 2016).

1.2. Contributions

This article develops a methodology to learn an objective function of an optimal control system using incomplete trajectory observations. The proposed key concept to achieve this goal is the recovery matrix, which is defined on any segment data of the system trajectory and a given candidate feature set. We show that learning of the objective function is related to the rank and kernel properties of the recovery matrix. Different from existing methods, the recovery matrix also captures the unseen future information in addition to the available data by an unknown costate variable, which is jointly estimated along with the unknown feature weights. By investigating the properties of the recovery matrix, the following insights to solving IOC problems are enabled:

(1) the rank of the recovery matrix indicates whether an observation of incomplete trajectory data is sufficient for learning the feature weights;

(2) additional observation data can contribute to learning the unknown objective function, or at least not degrade the learning; and irrelevant features can be identified;

(3) the IOC can be solved by incrementally incorporating the observation of each data point along the trajectory.

Based on the recovery matrix, an IOC approach based on incomplete trajectory observations is established, and an incremental IOC algorithm is developed by automatically finding the minimal required observation.

The structure of this article is as follows. Section 2 states the problem. Section 3 develops the recovery matrix and its properties. Section 4 presents the IOC method and algorithm using incomplete trajectory observations. Section 5 conducts numerical experiments, and Section 6 draws conclusions.

Notation

The column operator $col {x_{1}, \dots, x_{k}}$ stacks its arguments into a column. Here $x_{k_{1} : k_{2}}$ denotes a column stack of $x$ indexed from $k_{1}$ to $k_{2}$ ( $k_{1} \leq k_{2}$ ), that is, $x_{k_{1} : k_{2}} = col {x_{k_{1}}, \dots, x_{k_{2}}}$ . We use $A$ (bold) to denote a block matrix. Given a vector function $f (x)$ and a constant $x^{*}$ , $\frac{\partial f}{\partial x^{*}}$ denotes the Jacobian matrix with respect to $x$ evaluated at $x^{*}$ . The zero matrix and vector are written as 0, and the identity matrix as $I$ , both with appropriate dimensions. $A^{'}$ is the transpose of matrix $A$ . Here $σ_{i} (A)$ denotes the i th smallest singular value of matrix $A$ , e.g., $σ_{1} (A)$ is the smallest singular value. We use $\ker A$ to denote the kernel of matrix $A$ .

2. Problem formulation

Consider an optimal control system with the following discrete-time dynamics and initial condition:

x_{k + 1} = f (x_{k}, u_{k}), x_{0} \in R^{n}

(1)

where the vector function $f : R^{n} \times R^{m} \mapsto R^{n}$ is differentiable; $x_{k} \in R^{n}$ is the system state; $u_{k} \in R^{m}$ is the control input; and $k = 0, 1, \dots$ is the time step. Suppose a trajectory of states and inputs over a horizon T,

ξ = {ξ_{k} : k = 0, 1, \dots, T} with ξ_{k} = (x_{k}^{*}, u_{k}^{*})

(2)

(locally) minimizes a cost function

J (x_{0 : T}, u_{0 : T}) = \sum_{k = 0}^{T} ω^{'} ϕ^{*} (x_{k}, u_{k})

(3)

where $ω^{'} ϕ^{*} (\cdot, \cdot)$ is the running cost. Here $ϕ : R^{n} \times R^{m} \mapsto R^{s}$ is called a relevant feature vector and defined as a column of a relevant feature set

F^{*} = {ϕ_{1}^{*}, ϕ_{2}^{*}, \dots, ϕ_{s}^{*}}

(4)

that is, $ϕ^{*} = col F^{*}$ , with $ϕ_{i}^{*}$ being the i th feature for the running cost, and $ω \in R^{s}$ is called the weight vector, with the i th entry $ω_{i}$ corresponding to $ϕ_{i}^{*}$ . This type of weighted-feature objective function is commonly used in objective learning problems (Abbeel and Ng, 2004; Molloy et al., 2018; Ziebart et al., 2008), and has been successfully applied in a wide range of real-world applications (Bajcsy et al., 2018; Englert et al., 2017; Kuderer et al., 2015; Lin et al., 2016). The dynamics (1) and cost function (3) can represent different optimal control settings as follows: (I) finite-horizon free-end optimal control, where the finite horizon T is given but the final state $x_{T + 1}$ is free, i.e., no constraint on $x_{T + 1}$ ; (II) finite-horizon fixed-end optimal control, where both the finite horizon T and the final state $x_{T + 1} = x_{goal}$ are given; and (III) infinite-horizon optimal control, where $T = \infty$ . In addition, one can consider the finite-horizon optimal control, where the final state $x_{T + 1}$ is penalized using a final cost term added to (3), and this case can be viewed as an extension similar to (II). In our following expositions, we only focus on the first three settings.

In IOC problems, one is given a relevant feature set $F^{*}$ , the goal is to obtain an estimate of the weights $ω$ corresponding to these features from observations of $ξ$ . Note that $ω$ can only be determined up to a non-zero scaling factor (Keshavarz et al., 2011; Molloy et al., 2018), because any $c ω$ with $c > 0$ will lead to the same trajectory $ξ$ . Hence, we say an estimate $\hat{ω}$ is a successful estimate of $ω$ if $\hat{ω} = c ω$ with $c \neq 0$ , and the specific $c > 0$ can be determined by normalization (Englert et al., 2017; Keshavarz et al., 2011).

We have noted that existing IOC methods typically assume that the full trajectory data $ξ$ is available. Violation of this assumption will lead to a failure of existing approaches, as we demonstrate in the numerical evaluation in Section 5. In this article, we aim to address this challenge by developing a technique to estimate $ω$ only using incomplete trajectory data. Specifically, given a relevant feature set $F^{*}$ , the goal of this article is to achieve a successful estimate $\hat{ω} = c ω$ using an incomplete trajectory observation

ξ_{t : t + l} = {ξ_{k} : t \leq k \leq t + l} \subseteq ξ

(5)

which is a segment of $ξ$ within the time interval $[t, t + l] \subseteq [0, T]$ . Here, t is called the observation starting time and $l = 1, 2, \dots$ called the observation length, with $0 \leq t < t + l \leq T$ . Moreover, for any observation starting time t, we aim to efficiently find the minimal required observation, that is, $l_{\min}$ , to achieve a successful estimate of $ω$ . Note that in the above problem setting, we only know that the data $ξ_{t : t + l}$ is a segment of a system trajectory $ξ$ ; we do not know the value of t (i.e., the observation starting time relative to the start of the trajectory), and do not require knowledge of any other information about $ξ$ such as the time horizon T or which type of optimal control problem $ξ$ is a solution to.

3. The recovery matrix

In this section, we introduce the key concept of the recovery matrix and show its relation to IOC process. Some properties of the recovery matrix are investigated to provide insights into the IOC process. Connections between the recovery matrix and existing methods are also discussed. The implementation of the recovery matrix is finally presented.

3.1. Definition of the recovery matrix

We first present the definition of the recovery matrix, then show its relation to the IOC problem solution, which is also the motivation of the recovery matrix.

Definition 1. Let a segment of the trajectory, $ξ_{t : t + l} \subseteq ξ$ in (5), and a candidate feature set $F = {ϕ_{1}, ϕ_{2}, \dots, ϕ_{r}}$ be given. Let $ϕ = col F$ . Then the recovery matrix, denoted by $H (t, l)$ , is defined as

H (t, l) = [\begin{matrix} H_{1} (t, l) & H_{2} (t, l) \end{matrix}] \in R^{ml \times (r + n)}

(6)

with

H_{1} (t, l) = F_{u} (t, l) F_{x}^{- 1} (t, l) Φ_{x} (t, l) + Φ_{u} (t, l)

(7)

H_{2} (t, l) = F_{u} (t, l) F_{x}^{- 1} (t, l) V (t, l)

(8)

Here, $F_{x} (t, l)$ , $F_{u} (t, l)$ , $Φ_{x} (t, l)$ , $Φ_{u} (t, l)$ , and $V (t, l)$ are defined as

F_{x} (t, l) = [\begin{matrix} I & \frac{- \partial f^{'}}{\partial x_{t + 1}^{*}} \\ 0 & I & ⋱ \\ ⋱ & \frac{- \partial f^{'}}{\partial x_{t + l - 1}^{*}} \\ I \end{matrix}] \in R^{nl \times nl}

(9)

F_{u} (t, l) = [\begin{matrix} \frac{\partial f^{'}}{\partial u_{t}^{*}} \\ \frac{\partial f^{'}}{\partial u_{t + 1}^{*}} \\ ⋱ \\ \frac{\partial f^{'}}{\partial u_{t + l - 1}^{*}} \end{matrix}] \in R^{ml \times nl}

(10)

Φ_{x} (t, l) = {[\begin{matrix} \frac{\partial ϕ}{\partial x_{t + 1}^{*}} & \frac{\partial ϕ}{\partial x_{t + 2}^{*}} & \dots & \frac{\partial ϕ}{\partial x_{t + l}^{*}} \end{matrix}]}^{'} \in R^{nl \times r}

(11)

Φ_{u} (t, l) = {[\begin{matrix} \frac{\partial ϕ}{\partial u_{t}^{*}} & \frac{\partial ϕ}{\partial u_{t + 1}^{*}} & \dots & \frac{\partial ϕ}{\partial u_{t + l - 1}^{*}} \end{matrix}]}^{'} \in R^{ml \times r}

(12)

V (t, l) = {[\begin{matrix} 0 & \frac{\partial f}{\partial x_{t + l}^{*}} \end{matrix}]}^{'} \in R^{nl \times n}

(13)

respectively.

Before showing a relationship between the recovery matrix and IOC, we impose the following assumption on the given candidate feature set $F$ in Definition 1.

Assumption 1. In Definition 1, the candidate feature set $F$ contains as a subset the relevant features $F^{*}$ in (4), i.e., $F^{*} \subseteq F$ .

Assumption 1 requires that the relevant features $F^{*}$ in (4) are contained by the given candidate feature set $F$ , which means that $F$ also allows for including additional features that are irrelevant to the optimal control system. Although restrictive for choice of features, this assumption is likely to be fulfilled in implementation by providing a larger set including many features when the knowledge of exact relevant features is not available. Under Assumption 1, without loss of generality, we let

F = {ϕ_{1}^{*}, ϕ_{2}^{*}, \dots, ϕ_{s}^{*}, {\tilde{ϕ}}_{s + 1}, \dots, {\tilde{ϕ}}_{r}}

(14)

that is, the first s elements are from $F^{*}$ in (4). Then we have

ϕ (x, u) = col F = [\begin{matrix} ϕ^{*} (x, u) \\ \tilde{ϕ} (x, u) \end{matrix}] \in R^{r}

(15)

where $ϕ^{*} \in R^{s}$ are the relevant feature vector in (3) whereas $\tilde{ϕ} \in R^{(r - s)}$ corresponds to the features that are not in $F^{*}$ . We define a weight vector

\bar{ω} = col {ω, 0} \in R^{r}

(16)

corresponding to (15), where $ω$ are the weights in (3) for $ϕ^{*}$ . Based on (3), we can say that the system’s optimal trajectory $ξ$ in (2) also (locally) minimize the cost function of

J (x_{0 : T}, u_{0 : T}) = \sum_{k = 0}^{T} {\bar{ω}}^{'} ϕ (x_{k}, u_{k})

(17)

with the dynamics and initial condition in (1). Next, we distinguish the three optimal control settings, as described in the problem formulation, and then establish the relationship between the recovery matrix and the IOC problem solution.

Case I: Finite-horizon free-end optimal control. We first consider the optimal control setting with finite horizon T and free final state $x_{T + 1}$ . In this case, given the cost function (17) and the dynamics constraint (1), one can define the following Lagrangian:

L = J (x_{0 : T}, u_{0 : T}) + \sum_{k = 0}^{T} λ_{k + 1}^{'} (f (x_{k}, u_{k}) - x_{k + 1})

(18)

where $λ_{k + 1} \in R^{n}$ , $k = 0, 1, \dots, T$ , is Lagrange multipliers. According to the KKT optimality conditions (Boyd and Vandenberghe, 2004), there exist multipliers $λ_{1 : T + 1}^{*} = col {λ_{1}^{*}, λ_{2}^{*}, \dots, λ_{T}^{*}, λ_{T + 1}^{*}}$ , also referred to as costates, such that the optimal trajectory $ξ$ must satisfy the following conditions

\frac{\partial L}{\partial x_{1 : T + 1}^{*}} = 0

(19a)

\frac{\partial L}{\partial u_{0 : T}^{*}} = 0

(19b)

Based on the definitions in (9)–(13), Equations (19a) and (19b) can be written as

- F_{x} (0, T) λ_{1 : T}^{*} + Φ_{x} (0, T) \bar{ω} = 0 = - V (0, T) λ_{T + 1}^{*}

(20a)

F_{u} (0, T) λ_{1 : T}^{*} + Φ_{u} (0, T) \bar{ω} = 0

(20b)

respectively, where in (20a), $λ_{T + 1}^{*} = 0$ directly results from extending (19a) at the final state $x_{T + 1}$ . The optimality equations in (20) are established for full optimal trajectory $ξ$ . Given any segment of the trajectory, say $ξ_{t : t + l} \subseteq ξ$ in (5), the following equations can be obtained by partitioning (20a) and (20b) in rows,

- F_{x} (t, l) λ_{t + 1 : t + l}^{*} + Φ_{x} (t, l) \bar{ω} = - V (t, l) λ_{t + l + 1}^{*}

(21a)

F_{u} (t, l) λ_{t + 1 : t + l}^{*} + Φ_{u} (t, l) \bar{ω} = 0

(21b)

respectively. For (21), we note that when $ξ_{t : t + l} = ξ_{0 : T}$ , i.e., when the observation is the full trajectory data $ξ$ , (21) will become (20). Thus, a full trajectory observation can be viewed as a special case of an incomplete trajectory observation, and we further discuss this in Section 3.3.

Case II: Finite-horizon fixed-end optimal control. We next consider the optimal control setting with a finite horizon T and a given fixed final state $x_{T + 1} = x_{goal}$ . Given the cost function (17), the dynamics (1), and the final state constraint $x_{T + 1} = x_{goal}$ , one can define the following Lagrangian:

L = J + \sum_{k = 0}^{T} λ_{k + 1}^{'} (f (x_{k}, u_{k}) - x_{k + 1}) + λ_{goal}^{'} (x_{T + 1} - x_{goal})

(22)

where the difference from (18) is that the term $λ_{goal}^{'} (x_{T + 1} - x_{goal})$ is added because the final state is subject to the given $x_{goal}$ constraint, and $λ_{goal} \in R^{n}$ is the associated Lagrangian multiplier. Following a similar derivation as in Case I, one obtains the same equations in (21) for any segment data of the trajectory $ξ_{t : t + l} \subseteq ξ$ . Here, the only difference from Case I is that when $ξ_{t : t + l} = ξ_{0 : T}$ , one usually has $λ_{T + 1}^{*} = λ_{goal}^{*} \neq 0$ in this case due to the fixed final state constraint, whereas $λ_{T + 1}^{*} = 0$ in Case I. In addition, for the finite-horizon optimal control, in which the final state $x_{T + 1}$ is penalized using a final cost term added to (3), we can derive the similar result of $λ_{T + 1}^{*} \neq 0$ .

Case III: Infinite-horizon optimal control. For the infinite-horizon optimal control setting, the optimal trajectory $ξ$ is more conveniently characterized by the Bellman optimality condition (Bertsekas, 1995):

V (x_{k}^{*}) = {\bar{ω}}^{'} ϕ (x_{k}^{*}, u_{k}^{*}) + V (f (x_{k}^{*}, u_{k}^{*}))

(23)

where $V (x_{k}^{*})$ is the (unknown) optimal cost-to-go function evaluated at state $x_{k}^{*}$ . Next, we differentiate the Bellman optimality equation in (23) on both sides with respect to $x_{k}^{*}$ while denoting $λ_{k}^{*} = \frac{\partial V (x_{k})}{\partial x_{k}^{*}} \in R^{n}$ , and then obtain

λ_{k}^{*} = \frac{\partial {\bar{ϕ}}^{'}}{\partial x_{k}^{*}} \bar{ω} + \frac{\partial f^{'}}{\partial x_{k}^{*}} λ_{k + 1}^{*}

(24)

Differentiating the Bellman optimality equation (23) on both sides with respect to $u_{k}^{*}$ yields

0 = \frac{\partial {\bar{ϕ}}^{'}}{\partial u_{k}^{*}} \bar{ω} + \frac{\partial f^{'}}{\partial u_{k}^{*}} λ_{k + 1}^{*}

(25)

For any available trajectory segment $ξ_{t : t + l} \subseteq ξ$ , we stack Equation (24) for all $x_{t + 1 : t + l}^{*}$ and stack Equation (25) for all $u_{t : t + l - 1}^{*}$ , and obtain the same equations in (21).

From the analysis, we conclude that, for any trajectory segment $ξ_{t : t + l} \subseteq ξ$ , regardless of the corresponding optimal control problem, we can always use the segment data $ξ_{t : t + l}$ to establish Equations (21). Thus, in what follows, we do not distinguish the specific optimal control settings, and only focus on Equations (21) to show the relationship between the recovery matrix in Definition 1 and IOC problem solution.

By noting that $F_{x} (t, l)$ in (21a) is always invertible, we combine (21a) with (21b) and eliminate $λ_{t + 1 : t + l}^{*}$ , which then yields

\begin{matrix} (F_{u} (t, l) F_{x}^{- 1} (t, l) Φ_{x} (t, l) + Φ_{u} (t, l)) \bar{ω} \\ + (F_{u} (t, l) F_{x}^{- 1} (t, l) V (t, l)) λ_{t + l + 1}^{*} = 0 \end{matrix}

(26)

Considering the definition of the recovery matrix in (6)–(8), Equation (26) can be written as

\begin{matrix} H_{1} (t, l) \bar{ω} + H_{2} (t, l) λ_{t + l + 1}^{*} \\ = H (t, l) [\begin{matrix} \bar{ω} \\ λ_{t + l + 1}^{*} \end{matrix}] = 0 \end{matrix}

(27)

Equation (27) reveals that the weights $\bar{ω}$ and costate $λ_{t + l + 1}^{*}$ must satisfy a linear equation, where the coefficient matrix is exactly the recovery matrix that is defined on the trajectory segment $ξ_{t : t + l} \subseteq ξ$ , and candidate feature set $F$ . Here, the costate $λ_{t + l + 1}^{*}$ can be interpreted as a variable encoding the unseen future information beyond the observational interval $[t, t + l]$ . In fact, from the discussions for Case III, we note that costate $λ_{t + l + 1}^{*}$ is the gradient of the optimal cost-to-go function with respect to the state evaluated at $x_{t + l + 1}^{*}$ .

In IOC problems, in order to obtain an estimate of the unknown weights $\bar{ω}$ only using the available segment data $ξ_{t : t + l}$ , one also needs to account for the unknown $λ_{t + l + 1}^{*}$ , as in (27). The following theorem establishes a relationship between a trajectory segment $ξ_{t : t + l} \subseteq ξ$ and a successful estimate of the weights $\bar{ω}$ for given candidate features $F$ .

Theorem 1. Given a trajectory segment $ξ_{t : t + l} \subseteq ξ$ , let the recovery matrix $H (t, l)$ be defined as in Definition 1 with the candidate feature set $F$ satisfying Assumption 1. Let a vector $col {\hat{ω}, \hat{λ}} \neq 0$ satisfy $col {\hat{ω}, \hat{λ}} \in \ker H (t, l)$ with $\hat{ω} \in R^{r}$ . If

rank H (t, l) = r + n - 1

(28)

then there exists a constant $c \neq 0$ such that the i th entry of $\hat{ω}$ satisfies

{\hat{ω}}_{i} = (\begin{matrix} c ω_{i}, & if ϕ_{i} \in F^{*} \\ 0, & otherwise \end{matrix}

(29)

and vector $col {{\hat{ω}}_{i} : ϕ_{i} \in F^{*}, i = 1, 2, \dots, r} = c ω$ thus is a successful estimate of $ω$ in (3).

Proof. Based on Equations (21), we note that for a trajectory segment $ξ_{t : t + l} \subseteq ξ$ , there always exists $λ_{t + l + 1}^{*} \in R^{n}$ such that $col {\bar{ω}, λ_{t + l + 1}^{*}}$ satisfies (27), i.e., $col {\bar{ω}, λ_{t + l + 1}^{*}} \in \ker H (t, l)$ . Owing to (28) which means that the kernel of $H (t, l)$ is one-dimensional, any non-zero vector $col {\hat{ω}, \hat{λ}} \in \ker H (t, l)$ will have $\hat{ω} = c \bar{ω}$ ( $c \neq 0$ ). Thus, one can conclude that $\hat{ω}$ is a scaled version of $\bar{ω}$ , and that the entries in $\hat{ω}$ corresponding to the relevant features in $F^{*}$ will stack a successful estimate of $ω$ (3). This completes the proof. □

Remark. Theorem 1 states that the recovery matrix bridges trajectory segment data to the unknown objective function. First, the rank of the recovery matrix $H (t, l)$ indicates whether one is able to use the trajectory segment $ξ_{t : t + l} \subseteq ξ$ to obtain a successful estimate of weights $\bar{ω}$ for the given candidate features $F$ . In particular, if the rank condition (28) for the recovery matrix $H (t, l)$ is satisfied, then any non-zero vector $col {\hat{ω}, \hat{λ}}$ in the kernel of $H (t, l)$ has that: the vector of the first r entries in $col {\hat{ω}, \hat{λ}}$ , i.e., $\hat{ω}$ , satisfies $\hat{ω} = c \bar{ω}$ ; second, including additional irrelevant features in $F$ will not influence the weight estimate for the relevant features, because the weight estimates in $\hat{ω}$ for these irrelevant features will be zeros. We demonstrate this in numerical experiments in Section 5. We also demonstrate the use of the recovery matrix to solve different optimal control problems later in numerical experiments.

3.2. Properties of the recovery matrix

As the recovery matrix connects trajectory segment data to the unknown cost function, we next investigate the properties of the recovery matrix, which will provide us a better understanding of how the data and the selected features are incorporated in IOC process. We first present an iterative formula for the recovery matrix.

Lemma 1 (Iterative property). For a trajectory segment $ξ_{t : t + l} \subset ξ$ and the subsequent data point $ξ_{t + l + 1} = {x_{t + l + 1}^{*}, u_{t + l + 1}^{*}}$ , one has

\begin{matrix} H (t, l + 1) = [\begin{matrix} H_{1} (t, l + 1) & H_{2} (t, l + 1) \end{matrix}] \\ = [\begin{matrix} H_{1} (t, l) & H_{2} (t, l) \\ \frac{\partial ϕ^{'}}{\partial u_{t + l}^{*}} & \frac{\partial f^{'}}{\partial u_{t + l}^{*}} \end{matrix}] [\begin{matrix} I & 0 \\ \frac{\partial ϕ^{'}}{\partial x_{t + l + 1}^{*}} & \frac{\partial f^{'}}{\partial x_{t + l + 1}^{*}} \end{matrix}] \end{matrix}

(30)

with $H (t, 1)$ corresponding to $ξ_{t : t + 1} = (x_{t : t + 1}^{*}, u_{t : t + 1}^{*})$ :

\begin{matrix} H (t, 1) = [\begin{matrix} H_{1} (t, 1) & H_{2} (t, 1) \end{matrix}] \\ = [\begin{matrix} (\frac{\partial f^{'}}{\partial u_{t}^{*}} \frac{\partial ϕ^{'}}{\partial x_{t + 1}^{*}} + \frac{\partial ϕ^{'}}{\partial u_{t}^{*}}) & \frac{\partial f^{'}}{\partial u_{t}^{*}} \frac{\partial f^{'}}{\partial x_{t + 1}^{*}} \end{matrix}] \end{matrix}

(31)

Proof. Please see Appendix A. □

The iterative property shows that the recovery matrix can be calculated by incrementally integrating each subsequent data point $ξ_{t + 1 + 1}$ into the current recovery matrix $H (t, l)$ . Owing to this property, the computation of matrix inversions in the recovery matrix in Definition 1 can be avoided. This property will be used to devise efficient IOC algorithms in Section 4.2.

The recovery matrix is defined on two elements: one is the segment data $ξ_{t : t + l}$ and the other are the selected candidate features $F$ . In what follows, we show how these two components affect the recovery matrix and further the IOC process. For data observations, we expect that including more data points into $ξ_{t : t + l}$ may contribute to enabling the successful estimation of the unknown weights. This is implied by the following lemma.

Lemma 2 (Rank non-decreasing property). For a trajectory segment $ξ_{t : t + l} \subset ξ$ and any $F$ , one has

rank H (t, l) \leq rank H (t, l + 1)

(32)

if the new trajectory point $ξ_{t + l + 1} = (x_{t + l + 1}^{*}, u_{t + l + 1}^{*})$ has $det (\frac{\partial f}{\partial x_{t + l + 1}^{*}}) \neq 0$ .

Proof. Please see Appendix B. □

We have noted in Theorem 1 that the rank of the recovery matrix is related to whether one is able to use segment data $ξ_{t : t + l}$ to achieve a successful estimate of the weights. Thus, the rank of the recovery matrix can be viewed as an indicator of the capability of the available segment data $ξ_{t : t + l}$ to reflect the unknown weights. Lemma 2 postulates that additional data, if its Jacobian matrix of the dynamics is non-singular, tends to contribute to solving the IOC problem by increasing the rank of the recovery matrix towards satisfying (28), or at least will not make a degrading contribution. Further in Section 5.1.2, we analytically and experimentally demonstrate in which cases the additional observation data can increase the rank of the recovery matrix, and in which cases the additional data points cannot increase (i.e., maintain the recovery matrix rank).

The next lemma provides a necessary condition for the rank of the recovery matrix if the candidate feature set $F$ contains as a subset the relevant features $F^{*}$ , i.e., $F^{*} \subseteq F$ .

Lemma 3 (Rank upper bound property). If Assumption 1 holds, then for any trajectory segment $ξ_{t : t + l} \subseteq ξ$ ,

rank H (t, l) \leq r + n - 1

(33)

always holds. If there exists another relevant feature subset $\tilde{F} \subseteq F$ with corresponding weights $\tilde{ω}$ , here $\tilde{F} \neq F_{*}$ or $\tilde{ω} \neq c ω$ , then inequality (33) holds strictly:

rank H (t, l) < r + n - 1

(34)

Proof. Please see Appendix C. □

Lemma 3 states that if a candidate feature set contains as a subset the relevant features under which the system trajectory $ξ$ is optimal, the kernel of the recovery matrix (for any data segment) is at least one-dimensional. Moreover, when there exist more than one combination of relevant features among the given candidate features, which means there exists another subset of relevant features or another independent weight vector, then the rank condition (28) in Theorem 1 is impossible to fulfil for the trajectory segment $ξ_{t : t + l}$ regardless of the observation length l and starting time t (we also experimentally illustrate this in Section 5.1.2). This also implies that though Assumption 1 is likely to be satisfied by using a larger feature set that covers all possible features, it may also lead to the non-uniqueness of relevant features. On the other hand, if Assumption 1 fails to hold, that is, the candidate feature set $F$ does not contain a complete set of relevant features, then, owing to the rank non-decreasing property in Lemma 2, the recovery matrix is more likely to have $rank H (t, l) = r + n$ after increasing the observation length. To sum up, Lemma 3 can be leveraged to investigate whether the selection of candidate features is proper or not.

Combining Lemmas 2 and 3, we are able to show: under Assumption 1, (i) if the rank of the recovery matrix is less than $r + n - 1$ , then increasing the observation length l to include additional trajectory points may increase the rank; and (ii) once the segment reaches $rank H = r + n - 1$ , additional observation data will not increase the rank of the recovery matrix and the successful estimate of feature weights can be found in the kernel of the recovery matrix. We experimentally demonstrate this in Sections 5.1.1 and 5.2.1. In Section 5.1.2, we further analyze how additional observation data will change the rank of the recovery matrix.

3.3. Relationship with prior work

We next discuss the relationship between the recovery matrix and existing IOC techniques (Aghasadeghi and Bretl, 2014; Englert et al., 2017; Johnson et al., 2013; Keshavarz et al., 2011;Molloy et al., 2018, 2016; Puydupin-Jamin et al., 2012). In those methods, an observation of the system’s full trajectory $ξ$ is considered, for which a set of optimality equations, such as the KKT conditions (Boyd and Vandenberghe, 2004) or Pontryagin’s minimum principle (Pontryagin et al., 1962), is then established. As developed in (Englert et al., 2017; Molloy et al., 2018), based on optimality conditions, a general form for using trajectory data to establish a linear constraint on the unknown feature weights $ω$ can be summarized as

M (ξ) ω = 0

(35)

where $M (ξ)$ is the coefficient matrix that depends on the trajectory data $ξ$ . An implicit requirement by those methods is that the observed data $ξ$ itself has to be optimal with respect to the cost function, thus full trajectory data $ξ_{0 : T}$ is generally required (otherwise, incomplete data $ξ_{t : t + l} \subseteq ξ$ itself in general does not optimize the cost function).

Conversely, through the recovery matrix developed in this article, any trajectory segment $ξ_{t : t + l} \subseteq ξ$ poses a linear constraint on $ω$ by

H (t, l) [\begin{matrix} ω \\ λ_{t + l + 1} \end{matrix}] = H_{1} (t, l) ω + H_{2} (t, l) λ_{t + l + 1} = 0

(36)

Comparing (35) with (36), we have the following comments.

(1) If we consider the segment $ξ_{t : t + l} = ξ_{0 : T}$ , i.e., given the full trajectory $ξ$ , then, owing to $λ_{T + 1} = 0$ (assuming the end-free optimal control setting), Equation (36) becomes

H_{1} (0, T) ω = 0

(37)

Comparing (37) with (35) we immediately obtain

H_{1} (0, T) = M (ξ)

(38)

Thus, the coefficient matrix $M (ξ)$ that is commonly used in existing IOC methods can be considered as a special case of the recovery matrix when the available data is the full trajectory, i.e., $ξ_{t : t + l} = ξ_{0 : T}$ .

(2) However, as in (38), the coefficient matrix $M (ξ)$ only corresponds to the first term of the recovery matrix, i.e., $H_{1} (t, l)$ . A key difference of the recovery matrix is its ability to handle any incomplete data $ξ_{t : t + l} \subseteq ξ$ . As the incomplete data $ξ_{t : t + l}$ itself may not be optimal with respect to the objective function when $t + l < T$ , the unseen future information thus must be taken care of if one wants to successfully learn the unknown weights $ω$ . As in (36), the recovery matrix accounts for such unseen future information via its second term $H_{2} (t, l)$ and the unknown costate $λ_{t + l + 1}$ . As we demonstrate later in experiments (Section 5.1.3), such a step to account for the unseen future information can enable successful learning of the cost function using a very small segment of the full trajectory. Also importantly, as we have analyzed in Section 3.1, such an advantage enables the recovery matrix to solve the IOC problems for infinite-horizon optimal control systems. We also demonstrate this in Section 5.1.4 with a numerical example.

(3) In addition to the capability of dealing with incomplete observation data, the recovery matrix can also provide insights and computational efficiency for solving IOC problems, as presented in Section 3.2. Such properties and advantages, however, cannot be achieved using the coefficient matrix $M (ξ)$ in existing IOC methods (Aghasadeghi and Bretl, 2014; Englert et al., 2017; Johnson et al., 2013; Keshavarz et al., 2011; Molloy et al., 2018, 2016; Puydupin-Jamin et al., 2012).

3.4. Implementation of rank evaluation

In practice, directly checking the rank of the recovery matrix is challenging due to: (i) data noise; (ii) near-optimality of demonstrations, i.e., the observed trajectory slightly deviates from the optimal one; and (iii) computational error. Thus, one can use the following strategies to evaluate the rank of the recovery matrix.

Normalization. When the observed data is of low magnitude, the recovery matrix may have the entries rather close to zeros, which may affect the matrix rank evaluation due to computing rounding error. Hence, we perform a normalization of the recovery matrix before verifying its rank,

\bar{H} (t, l) = \frac{H (t, l)}{{‖ H (t, l) ‖}_{F}}

(39)

where $∥ \cdot ∥_{F}$ is the Frobenius norm and we only consider the recovery matrix that is not a zero matrix. Then

rank \bar{H} (t, l) = rank H (t, l)

(40)

Rank index. As we are only interested in whether the rank of the recovery matrix satisfies $rank H (t, l) = r + n - 1$ , instead of directly investigating the rank, we choose to look at the singular values of $\bar{H} (t, l)$ by introducing the following rank index

κ (t, l) = (\begin{matrix} 0, & if σ_{2} (\bar{H} (t, l)) = 0 \\ σ_{2} (\bar{H}) / σ_{1} (\bar{H}), & otherwise \end{matrix}

(41)

The condition $rank H (t, l) = r + n - 1$ is thus equivalent to $κ (t, l) = + \infty$ . However, owing to data noise, $κ (t, l) = + \infty$ usually cannot be reached and thus is a finite value (we demonstrate this in Section 5.2.4). We thus pre-set a threshold $γ$ and verify

κ (t, l) \geq γ

(42)

to decide whether $rank H (t, l) = r + n - 1$ is fulfilled or not. Later in Section 5.2.4, we show how observation data and noise levels influence the rank index $κ (t, l)$ , and how to accordingly choose a proper $γ$ .

4. Proposed IOC approaches

Using the recovery matrix, in this section we develop the IOC techniques using incomplete trajectory observations to learn the cost function formulated in Section 2. Furthermore, we propose an incremental IOC algorithm by automatically finding the minimal required observation length.

4.1. IOC using incomplete trajectory observations

The following corollary states a method to use an observation of the incomplete trajectory to achieve a successful estimate of weights for given relevant features.

Corollary 1 (IOC using incomplete trajectory observations). For the optimal control system in (1), given an incomplete trajectory observation $ξ_{t : t + l} \subseteq ξ$ and a relevant feature set $F = F^{*}$ in (4), the recovery matrix $H (t, l)$ is defined as in Definition 1. If

rank H (t, l) = s + n - 1

(43)

and a non-zero vector $col {\hat{ω}, \hat{λ}} \in \ker H (t, l)$ with $\hat{ω} \in R^{s}$ , then $\hat{ω}$ is a successful estimate of $ω$ , i.e., there must exist a non-zero constant c such that $\hat{ω} = c ω$ .

Proof. As Corollary 1 is a special case of Theorem 1, by following a similar procedure as in proof of Theorem 1, we can show that there must exist a costate $λ_{t + l + 1}^{*}$ such that the weight vector $ω$ in (3) jointly with $λ_{t + l + 1}^{*}$ satisfy

H (t, l) [\begin{matrix} ω \\ λ_{t + l + 1}^{*} \end{matrix}] = 0

(44)

As the rank condition (43) holds, it follows that the nullity of $H (t, l)$ is one. Then for any non-zero vector $col {\hat{ω}, \hat{λ}} \in \ker H (t, l)$ , there must exist a constant $c \neq 0$ such that $\hat{ω} = c ω$ . $\hat{ω}$ thus is a successful estimate of $ω$ , which completes the proof. □

Remark. Suppose that in Corollary 1, Equation (43) is not satisfied. According to Lemma 3, $rank H (t, l) < s + n - 1$ holds and, thus, the dimension of $\ker H (t, l)$ is at least two. This means that another weight vector, independent of $ω$ , could be found in $\ker H (t, l)$ , and the current segment $ξ_{t : t + l}$ may be generated by this different weight vector. In this case, true weights are not distinguishable or recoverable with respect to $ξ_{t : t + l}$ . The reason for this case can be insufficiently long observation or low data informativeness, both of which may be remedied by including additional data (i.e., increase l) according to Lemma 2 (we illustrate this later in experiments).

4.2. Incremental IOC algorithm

Combining Theorem 1 and the properties of the recovery matrix, one has the following conclusions: (i) as observations of more data points may contribute to increasing the rank of the recovery matrix (Lemma 2), which is however bounded from above (Lemma 3), thus the minimal required observation length $l_{\min}$ that reaches the rank upper bound can be found; (ii) from Lemmas 2 and 3, the minimal required observation length can be found even if additional irrelevant features exist; and (iii) from Lemma 1, the minimal required observation length can be found efficiently. In sum, we have the following incremental IOC approach.

Corollary 2 (Incremental IOC approach). Given candidate features $F$ satisfying Assumption 1, the recovery matrix $H (t, l)$ , starting from t, is updated at each time step with a new observed point $ξ_{t + l + 1} = (x_{t + l + 1}^{*}, u_{t + l + 1}^{*})$ via Lemma 1. Then the minimal required observation length for a successful estimate of the feature weights is

l_{\min} (t) = min {l | rank H (t, l) = | F | + n - 1}

(45)

For any non-zero vector $col {\hat{ω}, \hat{λ}} \in \ker H (t, l_{\min} (t))$ with $\hat{ω} \in R^{| F |}$ , $\hat{ω}$ is a successful estimate of the weights for $F$ with the weights for irrelevant features being zeros.

Proof. Corollary 2 is a direct application of Theorem 1 and Lemmas 1–3. □

From Corollary 2, we note that starting from time t, the minimal required observation length $l_{\min} (t)$ to solve IOC problems is the one satisfying (45). As we show later in experiments (Sections 5.1.2, 5.2.1, 5.2.2, and 5.2.3), $l_{\min} (t)$ varies depending on the informativeness of data $ξ_{t : t + l}$ and the selected candidate features. Whatever influences $l_{\min} (t)$ , one can always find a necessary lower bound of the minimal required observation length due to the size of the recovery matrix, $H (t, l) \in R^{ml \times (| F | + n)}$ , and matrix rank properties, that is,

l_{\min} (t) \geq ⌈ \frac{| F | + n - 1}{m} ⌉

(46)

where ⌈·⌉ is the ceiling operation. Equation (46) implies that including additional irrelevant features to $F$ will require more data in order to successfully solve IOC problems (as shown later in Section 5.2.3).

In practice, directly applying Corollary 2 is challenging in the presence of data noise, near-optimality of trajectory, computing error, etc. Thus, we adopt the following strategies for implementation. First, (45) can be investigated based on the rank index (41) by checking (42) (the choice of $γ$ is discussed later in Section 5.2.4). Second, the computation of a successful estimate can be implemented by solving the following constrained optimization

\hat{ω} = \arg min_{ω, λ} ‖ \bar{H} (t, l_{\min} (t)) [\begin{matrix} ω \\ λ \end{matrix}] ‖^{2}

(47)

subject to

\sum_{i = 1}^{| F |} ω_{i} = 1

(48)

where ∥·∥ denotes the $l_{2}$ norm and $\bar{H}$ is the normalized recovery matrix (see (39)). Here, to avoid trivial solutions, we add the constraint (48) to normalize the weight estimate to have sum of one, as used in Englert et al. (2017).

In sum, the implementation of the proposed incremental IOC approach in Corollary 2 is presented in Algorithm 1. Algorithm 1 permits arbitrary observation starting time, and the observation length is automatically found by checking the rank condition using (41) and (42). The algorithm can be viewed as an adaptive-observation-length IOC algorithm.

Algorithm 1: Incremental IOC algorithm
Input: a candidate feature set $F$ , a threshold $γ$ ;
Initial: Any observation starting time t;
Initialize $l = 1$ , $H (t, l)$ with $ξ_{t : t + 1} = (x_{t : t + 1}^{}, u_{t : t + 1}^{})$ via (31);
while $H (t, l)$ not satisfying (42) do
Obtain subsequent data $ξ_{t + l + 1} = (x_{t + l + 1}^{}, u_{t + l + 1}^{})$ ;
Update $H (t, l)$ with $ξ_{t + l + 1}$ via (30);
$l \leftarrow l + 1$ ;
end
minimal required observation length: $l_{\min} (t) = l$ ;
compute a successful estimate $\hat{ω}$ via (47)–(48).

5. Numerical experiments

We evaluate the proposed method on two systems. First, on a linear quadratic regulator (LQR) system, we demonstrate the rank properties of the recovery matrix, show its capability of handling incomplete trajectory data by comparing with the related IOC methods, and demonstrate its capability to solve IOC for infinite-horizon LQR. Second, on a simulated two-link robot arm, we evaluate the proposed techniques in terms of observation noise, including irrelevant features, and parameter settings. Throughout evaluations, we quantify the accuracy of a weight estimate $\hat{ω}$ by introducing the following estimation error:

e_{ω} = inf_{c > 0} \frac{∥ c \hat{ω} - ω ∥}{∥ ω ∥}

(49)

where ∥·∥ denotes the $l_{2}$ norm, $\hat{ω}$ is the weight estimate, and $ω$ is the ground truth. Obviously, $e_{ω} = 0$ means that $\hat{ω}$ is a successful estimate of $ω$ . The source codes are available at the following link: https://github.com/wanxinjin/IOC-from-Incomplete-Trajectory-Observations.

5.1. Evaluations on LQR Systems

Consider a finite-horizon free-end LQR system where the dynamics is

x_{k + 1} = [\begin{matrix} - 1 & 1 \\ 0 & 1 \end{matrix}] x_{k} + [\begin{matrix} 1 \\ 3 \end{matrix}] u_{k}

(50)

with initial $x_{0} = [2, - 2]^{'}$ , and quadratic cost function is

J = \sum_{k = 0}^{T} (x_{k}^{'} Q x_{k} + u_{k}^{'} R u_{k})

(51)

with the time horizon $T = 50$ . Here, Q and R are positive definite matrices and assumed to have the structure

Q = [\begin{matrix} q_{1} & 0 \\ 0 & q_{2} \end{matrix}], R = r

(52)

respectively. In the feature-weight form (3), the cost function (51) corresponds to the feature vector $ϕ^{*} = [x_{1}^{2}, x_{2}^{2}, u^{2}]^{'}$ and weights $ω = [q_{1}, q_{2}, r]^{'}$ . We here set $ω = [0.1, 0.3, 0.6]^{'}$ to generate the optimal trajectory of the LQR system, which is plotted in Figure 1. In IOC problems, we are given the features $ϕ^{*}$ ; the goal is to solve a successful estimate of $ω$ using the optimal trajectory data in Figure 1.

Fig. 1.

The optimal trajectory of a LQR system (50)–(51) using the weights $ω = [0.1, 0.3, 0.6]^{'}$ .

5.1.1. Minimal required observations for IOC

Based on the above LQR system, we here illustrate how the recovery matrix can be used to check whether incomplete trajectory data suffices for the minimal observation required for a successful estimation. Given the features $ϕ^{*} = [x_{1}^{2}, x_{2}^{2}, u^{2}]^{'}$ , we set the observation starting time $t = 0$ , and incrementally increase the observation length l from 1 to horizon $T = 50$ . For each observation length l, we check the rank of the recovery matrix $H (0, l)$ and solve the weights from the kernel of $H (0, l)$ (the weights are normalized to have a sum of one). The results are plotted in Figure 2.

Fig. 2.

The rank of the recovery matrix and weight estimate when the observation starts at $t = 0$ and the observation length l increases from 1 to T. The upper panel shows the rank of the recovery matrix $H (0, l)$ versus l; and the bottom panel shows the corresponding weight estimate for each l. Note that the given features are $ϕ^{*} = [x_{1}^{2}, x_{2}^{2}, u^{2}]^{'}$ and the ground-truth weights are $ω = [0.1, 0.3, 0.6]^{'}$ . For $l < 4$ , because the dimension of the kernel of $H (0, l)$ is at least two and, thus, $H (0, l) [\hat{ω}, \hat{λ}]^{'} = 0$ has multiple solutions of $[\hat{ω}, \hat{λ}]^{'}$ , we choose the solution $\hat{ω}$ from the kernel of $H (0, l)$ randomly.

As shown in the upper panel in Figure 2, including additional trajectory data points, i.e., increasing the observation length l (from 1), leads to an increase of the rank of the recovery matrix. When $l = 4$ $rank H (0, l)$ reaches to 4, which is the rank upper bound $n + r - 1 = 4$ , and then $rank H (0, l) = 4$ for all $l \geq 4$ . This illustrates the properties of the recovery matrix in Lemmas 2 and 3. From the bottom panel in Figure 2, we see that when $l < 4$ , for which $rank H (0, l) < 4$ , the weight estimate $\hat{ω}$ is not a successful estimate of $ω$ . When $rank H (0, l) < 4$ , since the dimension of the kernel of $H (0, l)$ is at least two and, thus, $H (0, l) [\hat{ω}, \hat{λ}]^{'} = 0$ has multiple solutions $[\hat{ω}, \hat{λ}]^{'}$ , we choose the solution $\hat{ω}$ from the kernel of $H (0, l)$ randomly. After $l \geq 4$ when $rank H (0, l) = 4$ , the estimate converges to a successful estimate, thus indicating the effectiveness of using the rank condition in (45) to check whether an incomplete observation suffices for the minimal required observation.

5.1.2. Recovery matrix rank for additional observations

Based on the LQR system, we next show how additional observations affect the rank of the recovery matrix. Here, we vary the observation starting time t and use different candidate feature sets $F$ , and for each case, we incrementally increase the observation length from $l = 1$ while checking the rank of the recovery matrix until the rank reaches its maximum. The results are presented in Figure 3. For the first three cases Figure 3(a)–(c), we set the observation starting time at $t = 5$ , $t = 28$ , and $t = 30$ , respectively, and use a candidate feature set $F = {x_{1}^{2}, x_{2}^{2}, u^{2}, u^{3}}$ ; for the fourth case in Figure 3(d), we set the observation starting time at $t = 5$ and use a candidate feature set $F = {x_{1}^{2}, x_{2}^{2}, u^{2}, 2 u^{2}}$ . Based on the results, we have the following observations and comments.

Fig. 3.

The rank of the recovery matrix versus the observation length l. For (a), (b), and (c), the observation starting time is at $t = 5$ , $t = 28$ , and $t = 30$ , respectively, and the given candidate feature set is $F = {x_{1}^{2}, x_{2}^{2}, u^{2}, u^{3}}$ . For (d), the observation starting time is at $t = 5$ and the given candidate feature set is $F = {x_{1}^{2}, x_{2}^{2}, u^{2}, 2 u^{2}}$ . In (d), because $F$ contains two dependent features: $u^{2}$ and $2 u^{2}$ ; thus multiple combinations of these features can be found in $F$ to characterize the optimal trajectory, that is, ${x_{1}^{2}, x_{2}^{2}, u^{2}}$ and ${x_{1}^{2}, x_{2}^{2}, 2 u^{2}}$ , and the rank upper bound according to Lemma 3 is $rank H (t, l) < r + n - 1 = 5$ and cannot reach five.

(1) From Figure 3(a)–(c), we can see that additional observation (i.e., increasing observation length l) increases or maintains the rank of the recovery matrix, as stated in Lemma 2, and that continuously increasing the observation length will lead to the upper bound of the recovery matrix’s rank, as stated in Lemma 3.

(2) Comparing Figure 3(a) with Figure 3(d), we see that although the number of candidate features for both cases are the same, i.e., $| F | = r = 4$ , their corresponding maximum ranks are different: the case in Figure 3(a) achieves $max rank H = 5 = r + n - 1$ (which is the rank condition (28) for a successful estimate), whereas in Figure 3(d) the rank reaches $max rank H = 4 < r + n - 1$ . This is because $F = {x_{1}^{2}, x_{2}^{2}, u^{2}, 2 u^{2}}$ used in Figure 3(d) contains two dependent features, i.e., $u^{2}$ and $2 u^{2}$ , thus multiple combinations of features, e.g., ${x_{1}^{2}, x_{2}^{2}, u^{2}}$ and ${x_{1}^{2}, x_{2}^{2}, 2 u^{2}}$ , can be found in $F$ to characterize the optimal trajectory. Based on (34) in Lemma 3, $rank H (t, l) \leq 4$ for all t and l and the condition $rank H (t, l) = 5 = r + n - 1$ for a successful recovery will never be fulfilled.

(3) Comparing Figures 3(a)–(c), we note that in some cases additional observations will not increase the rank of the recovery matrix, e.g., when the observation length is $l = 5, 6, 7, 8$ in Figure 3(b) and $l = 5, 6$ in Figure 3(c). This can be explained using the following relations:

\begin{matrix} rank H (t, l + 1) \\ = rank [\begin{matrix} H_{1} (t, l) & H_{2} (t, l) \\ \frac{\partial ϕ^{'}}{\partial u_{t + l}^{*}} & \frac{\partial f^{'}}{\partial u_{t + l}^{*}} \end{matrix}] [\begin{matrix} I & 0 \\ \frac{\partial ϕ^{'}}{\partial x_{t + l + 1}^{*}} & \frac{\partial f^{'}}{\partial x_{t + l + 1}^{*}} \end{matrix}] \\ = rank [\begin{matrix} H_{1} (t, l) & H_{2} (t, l) \\ \frac{\partial ϕ^{'}}{\partial u_{t + l}^{*}} & \frac{\partial f^{'}}{\partial u_{t + l}^{*}} \end{matrix}] \\ \geq rank [\begin{matrix} H_{1} (t, l) & H_{2} (t, l) \end{matrix}] = rank H (t, l) \end{matrix}

where the first two lines are directly from (30), and the last two lines are due to $det (\frac{\partial f^{'}}{\partial x_{t + l + 1}^{*}}) = det (\begin{matrix} - 1 & 0 \\ 1 & 1 \end{matrix}) \neq 0$ and matrix rank properties. The above equation says that the new observation $ξ_{t + l + 1} = (x_{t + l + 1}^{*}, u_{t + l + 1}^{*})$ is incorporated into the recovery matrix $H (t, l)$ in the form of appending m row vectors $[\frac{\partial ϕ^{'}}{\partial u_{t + l}^{*}} \frac{\partial f^{'}}{\partial u_{t + l}^{*}}] \in R^{m \times (r + n)}$ to the bottom of $H (t, l)$ . If the new observed data point $ξ_{t + l + 1} = (x_{t + l + 1}^{*}, u_{t + l + 1}^{*})$ is non-informative, in other words, if the appended rows in $[\frac{\partial ϕ^{'}}{\partial u_{t + l}^{*}} \frac{\partial f^{'}}{\partial u_{t + l}^{*}}]$ are dependent on the row vectors in $H (t, l)$ , then according to the matrix rank properties, one will have $rank H (t, l) = rank H (t, l + 1)$ , thus the new data $ξ_{t + l + 1}$ will not increase the rank of the recovery matrix. Otherwise, if the appended rows in $[\frac{\partial ϕ^{'}}{\partial u_{t + l}^{*}} \frac{\partial f^{'}}{\partial u_{t + l}^{*}}]$ are independent of the row vectors in $H (t, l)$ , that is, the new observed data $ξ_{t + l + 1} = (x_{t + l + 1}^{*}, u_{t + l + 1}^{*})$ is informative, then this new $ξ_{t + l + 1}$ will increase the rank of the recovery matrix, i.e., $rank H (t, l) < rank H (t, l + 1)$ .

5.1.3. Comparison with prior work

Here we demonstrate how the recovery matrix is able to solve IOC problems using incomplete observations. We show this by comparing with a recent inverse-KKT method developed in (Englert et al., 2017). The idea of the inverse-KKT method is based on the optimality equations similar to (35) using full trajectory data $ξ$ . As suggested by (Englert et al., 2017), the weights are estimated by minimizing

min_{ω} ∥ M (ξ) ω ∥^{2}

(53)

subject to $\sum_{i} ω_{i} = 1$ . Although the inverse-KKT method (Englert et al., 2017) is developed based on full trajectory data $ξ$ , we here want to see its performance when only incomplete data $ξ_{t : t + l} \subseteq ξ$ is given.

As analyzed in Section 3.3, the coefficient matrix $M (ξ)$ is a special case of the recovery matrix when $ξ_{t : t + l} = ξ_{0 : T}$ , that is,

M (ξ) = H_{1} (0, T)

(54)

Recall that this is because the LQR in (50)–(51) is a free-end optimal control system, as analyzed in Section 3.1, $λ_{T + 1} = 0$ . Given incomplete observation data $ξ_{t : t + l} \subseteq ξ$ , comparing the inverse-KKT method

min_{ω} ‖ M (ξ_{t : t + l}) ω ‖^{2} s . t . \sum_{i} ω_{i} = 1

(55)

with the proposed recovery matrix method

min_{ω, λ} ‖ H_{1} (t, l) ω + H_{2} (t, l) λ ‖^{2} s . t . \sum_{i} ω_{i} = 1

(56)

can show us how the unseen future data influences learning of the cost function.

For the LQR trajectory in Figure 1, we use the feature set $F = {x_{1}^{2}, x_{2}^{2}, u^{2}}$ , set the observation starting time t to be 0, 2, and 40, respectively, and for each observation starting time t, we increase the observation length l from 1 to the end of the trajectory, i.e., $t + l = T$ . With each observation $ξ_{t : t + l}$ , we solve the weight estimate using the inverse-KKT method (55) and the proposed method (56) and evaluate the estimation error $e_{ω}$ in (49), respectively. Results are shown in Figure 4, based on which we have the following comments.

Fig. 4.

Comparison between the inverse-KKT method (55) and proposed recovery matrix method (56) when given incomplete trajectory observation $ξ_{t : t + l}$ . Different observation starting time t is used: $t = 0$ in (a), $t = 2$ in (b), and $t = 40$ in (c). For each case, we increase the observation length l from 1 to the end of the horizon, i.e., $t + l = T$ , and for each l, the estimation error $e_{ω}$ for both methods is evaluated, respectively. Note that the estimation error is defined in (49).

(1) The inverse-KKT method is sensitive to the starting time of the observation sequence. When the observation starts from $t = 0$ (Figure 4(a)), the inverse-KKT method achieves a successful estimate after observation length $l \geq 30$ ; when $t = 2$ (Figure 4(b)) and $t = 40$ (Figure 4(c)) only when l reaches the end of trajectory, can the inverse-KKT method obtain the successful estimate.

(2) As we have analyzed in Section 3.3, the success of the inverse-KKT method requires that the given data $ξ_{t : t + l}$ itself minimizes the cost function, which is only guaranteed when the observation reaches the trajectory end, i.e., $t + l = T$ . This explains the results in Figure 4(b) and (c). Given incomplete $ξ_{t : t + l}$ ( $t + l < T$ ), although the inverse-KKT method still achieves a successful estimate in Figure 4(a), such performance is not guaranteed and heavily relies on “informativeness” of the given incomplete data relative to unseen future information. In Figure 1, because the trajectory data at beginning phase is more “informative” than the rest, the inverse-KKT method starting from $t = 0$ uses less data to converge ((Figure 4(a)) than starting from $t = 2$ (Figure 4(b)).

(3) In contrast, Figure 4 shows the effectiveness of using the recovery matrix to deal with incomplete observations. The proposed method guarantees a successful estimate after a much smaller observation length (e.g., around $l = 4$ for all three cases). This advantage is because the unseen future information is accounted for by $H_{2} (t, l)$ in the recovery matrix and the related unknown future variable $λ_{t + l + 1}$ is jointly estimated in (56).

In sum, we make the following conclusions. First, existing KKT-based methods generally require a full trajectory, and cannot deal with incomplete trajectory data. Second, the proposed recovery matrix method addresses this by jointly accounting for unseen future information; and the recovery matrix presents a systematic way to check whether a trajectory segment is sufficient to recover the objective function and if so, to solve it only using the segment data. Third, existing KKT-based methods can be viewed as a special case of the proposed recovery matrix method when the segment data is the full trajectory.

5.1.4. IOC for Infinite-horizon LQR

We demonstrate the ability of the proposed method to solve the IOC problem for an infinite-horizon control system. We still use the LQR system in (50)–(51) as an example, but here we set the time horizon $T = \infty$ (other conditions and parameters remain the same). The optimal trajectory in this case is a result of feedback control $u = K x$ with a constant control gain K solved by the algebraic Riccati equation (Bertsekas, 1995). For the above infinite-horizon LQR (with Q and R in (52)), the control gain is solved as $K = [- 0.1472 - 0.1918]$ .

In IOC, suppose that we observe an arbitrary segment from the infinite-horizon trajectory; here we use the segment data within the time interval $[t, t + l] = [8, 33]$ , namely, $ξ_{8 : 33}$ with $t = 8$ and $l = 25$ . We set the candidate feature set $F = {x_{1}^{2}, x_{2}^{2}, u^{2}}$ . The IOC results using Algorithm 1 are presented in Figure 5. Here we fix the observation starting time $t = 8$ while increasing l from 1 to the time interval end 25. The upper panel of Figure 5 shows $rank H (8, l)$ versus increasing observation length l, and the bottom panel shows the weight estimate $\hat{ω}$ of each l solved from the kernel of the recovery matrix $H (8, l)$ . As shown in the upper panel, with the observation length l increasing, $rank H (8, l)$ quickly reaches the upper bound rank $r + n - 1 = 4$ after $l \geq 4$ , indicting the successful estimate of the weights as shown in the bottom panel. The results demonstrate the ability of the proposed method to solve IOC problems for infinite-horizon optimal control systems.

Fig. 5.

IOC results for infinite-horizon LQR system. The observation starting time is $t = 8$ and the observation length l increases from $l = 1$ to 25. The upper panel shows the rank of the recovery matrix versus increasing l, and the bottom panel is the corresponding weight estimate for each l.

5.2. Evaluation on a two-link robot arm

To evaluate the proposed method on a non-linear plant, we use a two-link robot arm system, as shown in Figure 6. The dynamics of the two-link arm (Spong and Vidyasagar, 2008: p. 209) moving in the vertical plane is

M (θ) \overset{\cdot\cdot}{θ} + C (θ, \overset{\cdot}{θ}) \overset{\cdot}{θ} + g (θ) = \bar{τ}

(57)

where $θ = [θ_{1}, θ_{2}]^{'} \in R^{2}$ is the joint angle vector; $M (θ) \in R^{2 \times 2}$ is the positive-definite inertia matrix; $C (θ, \overset{\cdot}{θ}) \in R^{2 \times 2}$ is the Coriolis matrix; $g (θ) \in R^{2}$ is the gravity vector; and $\bar{τ} = [τ_{1}, τ_{2}]^{'} \in R^{2}$ are the input torques applied to each joint. The parameters of the two-link robot arm in Figure 6 are as follows. The mass of each link is $m_{1} = 1 kg$ , $m_{2} = 1 kg$ ; the length of each link is $l_{1} = 1 m$ , $l_{2} = 1 m$ ; the distance from the joint to the center of mass for each link is $r_{1} = 0.5 m$ , $r_{2} = 0.5 m$ ; and the moment of inertia with respect to the center of mass for each link is $I_{1} = 0.5 kg m^{2}$ , $I_{2} = 0.5 kg m^{2}$ . From (57), we have

\overset{\cdot\cdot}{θ} = M (θ)^{- 1} (- C (θ, \overset{\cdot}{θ}) \overset{\cdot}{θ} - g (θ) + \bar{τ})

(58)

which can be further expressed in state-space representation

\overset{\cdot}{x} = f (x, u)

(59)

with the system state and input defined as

x = {[\begin{matrix} θ_{1} & θ_{2} & {\overset{\cdot}{θ}}_{1} & {\overset{\cdot}{θ}}_{2} \end{matrix}]}^{'}, u = {[\begin{matrix} τ_{1} & τ_{2} \end{matrix}]}^{'}

(60)

respectively. We consider the following finite-horizon fixed-end optimal control for the above robot arm system:

\begin{matrix} \min_{x_{1 : T}} & \sum_{k = 0}^{T} ω^{'} ϕ^{*} (x_{k}, u_{k}) \\ s . t . & x_{k + 1} = x_{k} + Δ f (x_{k}, u_{k}) \\ x_{0} = x_{start} \\ x_{T + 1} = x_{goal} \end{matrix}

(61)

where $Δ = 0.01 s$ is the discretization interval. In (61), we specify the initial state $x_{start} = [0, 0, 0, 0]^{'}$ , goal state $x_{goal} = [\frac{π}{2}, - \frac{π}{2}, 0, 0]^{'}$ , the time horizon $T = 100$ , and the feature vector and the corresponding weights

ϕ^{*} = {[\begin{matrix} τ_{1}^{2} & τ_{2}^{2} & τ_{1} τ_{2} \end{matrix}]}^{'} ω = {[\begin{matrix} 0.6 & 0.3 & 0.1 \end{matrix}]}^{'}

(62)

respectively. We solve the above optimal control system (61) using the CasADi software (Andersson et al., 2019) and plot the resulting trajectory in Figure 7.

Fig. 6.

Two-link robot arm with coordinate definitions.

Fig. 7.

The optimal trajectory of the two-link robot arm optimal control system (61) and (62).

5.2.1. Minimal required observations for IOC

Based on the described robot arm system, we first show the use of the recovery matrix to check whether an incomplete trajectory observation suffices for the minimal observation required for successful IOC. As an example, in Figure 7, we set the observation starting time at $t = 50$ . While increasing the observation length l from 1, we check the rank of $H (50, l)$ , solve the weight estimate $\hat{ω}$ from the kernel of $H (50, l)$ , and evaluate the estimation error $e_{ω}$ in (49) for $\hat{ω}$ . This process is repeated for three different candidate feature sets: $F = {τ_{1}^{2}, τ_{2}^{2}, τ_{1} τ_{2}}, F = {τ_{1}^{2}, τ_{2}^{2}, τ_{1} τ_{2}, τ_{1}^{3}, τ_{2}^{3}, τ_{1}^{2} τ_{2}},$ and $F = {τ_{1}^{2}, τ_{2}^{2}, τ_{1} τ_{2}, τ_{1}^{3}, τ_{2}^{3}, τ_{1} τ_{2}^{2}, τ_{1}^{4}, τ_{2}^{4}, τ_{1}^{3} τ_{2}, τ_{1}^{2} τ_{2}^{2}}$ , respectively. The results are shown in Figure 8.

Fig. 8.

The rank of the recovery matrix and corresponding estimation error $e_{ω}$ under different observation length l and different candidate feature sets. The upper panel shows the rank of the recovery matrix versus l, and the bottom panel shows the estimation error $e_{ω}$ for each l. Three feature sets, $F = {τ_{1}^{2}, τ_{2}^{2}, τ_{1} τ_{2}}, F = {τ_{1}^{2}, τ_{2}^{2}, τ_{1} τ_{2}, τ_{1}^{3}, τ_{2}^{3}, τ_{1}^{2} τ_{2}},$ and $F = {τ_{1}^{2}, τ_{2}^{2}, τ_{1} τ_{2}, τ_{1}^{3}, τ_{2}^{3}, τ_{1} τ_{2}^{2}, τ_{1}^{4}, τ_{2}^{4}, τ_{1}^{3} τ_{2}, τ_{1}^{2} τ_{2}^{2}}$ , are used, respectively, and the corresponding results are plotted in different lines. Note that when $rank H (50, l) < | F | + n - 1$ , because the dimension of the kernel of $H (50, l)$ is at least 2, we thus choose $\hat{ω}$ from the kernel of $H (50, l)$ randomly.

Results in the upper panel of Figure 8 show that additional observations increase the rank of the recovery matrix. Once the additional observations lead to the upper-bound rank of the recovery matrix, i.e., $| F | + n - 1$ , the corresponding length is the minimal observation length $l_{\min} (t)$ required for a successful weight estimate, as shown in the corresponding bottom panel in Figure 8. Moreover, including additional irrelevant features in $F$ will lead to the increased minimal required observation length $l_{\min} (t)$ , as implied by (46). This is discussed later in Section 5.2.3.

5.2.2. Observation noise

We test the proposed incremental IOC approach (Algorithm 1) under different data noise levels. We add to the trajectory (both states and inputs) in Figure 7 Gaussian noise of different levels that are characterized by different standard deviations from $σ = 10^{- 5}$ to $σ = 10^{- 1}$ . In Algorithm 1, we use $F = {τ_{1}^{2}, τ_{2}^{2}, τ_{1} τ_{2}}$ and set $γ = 45$ (the choice of $γ$ is discussed later in Section 5.2.4).

We set the observation starting time t at all time instants except for those near the trajectory end which cannot provide sufficient subsequent observation length. As an example, we present the experimental results for the case of noise level $σ = 10^{- 2}$ in Figure 9. Here, the upper panel shows the minimal required observation length $l_{\min} (t)$ automatically found for each observation starting time t, and the bottom shows the corresponding weight estimate using the minimal required observation data $ξ_{t : t + l_{\min} (t)}$ .

Fig. 9.

IOC by automatically finding the minimal required observation under noise level $σ = 10^{- 2} .$ The x-axis is the different observation starting time t. The upper panel shows the automatically found minimal required observation length $l_{\min} (t)$ at different t, and the bottom panel shows the corresponding estimate $\hat{ω}$ via (47). Note that ground truth $ω = [0.6, 0.3, 0.1]^{'}$ .

From Figure 9, we see that the automatically found minimal required observation length $l_{\min} (t)$ varies depending on the observation starting time t. This can be interpreted by noting that the trajectory data in Figure 7 in different intervals has different informativeness to reflect the cost function. For example, according to Figure 9, we can postulate that the beginning and final portions of the trajectory data are more “data-informative” than other portions, thus needing smaller $l_{\min} (t)$ to achieve the successful estimate. This can be understood if we consider that the beginning and final portions of the trajectory in Figure 7 has richer patterns such as curvatures than the middle which are more smooth. Using the recovery matrix, the data informativeness about the cost function is quantitatively indicated by the recovery matrix’s rank. Even under observation noise, the proposed method can adaptively find the sufficient observation length size such that the data is informative enough to guarantee a successful estimate of the weights, as shown by both upper and bottom panels in Figure 9.

We summarize all results under different noise levels in Table 1. Here the minimal required observation length is presented in percentage with respect to the total horizon T. According to Table 1, under a fixed rank index threshold (here $γ = 45$ ), we can see that high noise levels, on average, will lead to larger minimal required observation length, but the estimation error is not influenced too much. This is because the increased observation length can compensate for the uncertainty induced by data noise and finally produces a “neutralized” estimate. Hence, the results prove the robustness of the proposed incremental IOC algorithm against the small observation noise. We show later how to further improve the accuracy by adjusting $γ$ .

Table 1.

Results of incremental IOC (Algorithm 1, $γ = 45$ ) under different noise levels.

Noise level $σ$	Averaged $l_{\min} / T$ (%)^†	Averaged $e_{ω}$ ^†
$σ = 10^{- 5}$	$8 %$	$4.3 \times 10^{- 4}$
$σ = 10^{- 4}$	$8.1 %$	$4.0 \times 10^{- 3}$
$σ = 10^{- 3}$	$12.61 %$	$8.1 \times 10^{- 3}$
$σ = 10^{- 2}$	$33.8 %$	$8.5 \times 10^{- 3}$
$σ = 10^{- 1}$	$70.0$ %	$7.1 \times 10^{- 3}$

†

The average is calculated based on all successful estimations over all observation cases (varying observation starting time).

5.2.3. Presence of irrelevant features

We here assume that exact knowledge of relevant features is not available, and we evaluate the performance of Algorithm 1 given a feature set including irrelevant features. We add all observation data with Gaussian noise of $σ = 10^{- 3}$ . In Algorithm 1, we set $γ = 45$ and construct a feature set $F$ based on the following candidate features

{τ_{1}^{2}, τ_{2}^{2}, τ_{1} τ_{2}, τ_{1}^{3}, τ_{2}^{3}, τ_{1} τ_{2}^{2}, τ_{1}^{2} τ_{2}, τ_{1}^{4}, τ_{2}^{4}, τ_{1}^{3} τ_{2}, τ_{1} τ_{2}^{3}, τ_{1}^{2} τ_{2}^{2}}

(63)

Algorithm 1 is applied the same way as in the previous experiment: by starting the observation at all time instants except for those near the trajectory end. We provide different candidate feature sets in the first column in Table 2, and for each case we compute the average of the minimal required observation length and the average of estimation error in (49). The results are summarized in second and third columns in Table 2.

Table 2.

Results of incremental IOC (Algorithm 1, $γ = 45$ ) with different candidate feature sets.

Candidate feature set $F$	Averaged $l_{\min} / T$ ^†	Averaged $e_{ω}$ ^†
${τ_{1}^{2}, τ_{2}^{2}, τ_{1} τ_{2}}$	$12.18 %$	$4.2 \times 10^{- 3}$
$\begin{matrix} {τ_{1}^{2}, τ_{2}^{2}, τ_{1} τ_{2}, τ_{1}^{3}, τ_{2}^{3}} \end{matrix}$	$14.7 %$	$9.7 \times 10^{- 3}$
$\begin{matrix} {τ_{1}^{2}, τ_{2}^{2}, τ_{1} τ_{2}, τ_{1}^{3}, τ_{2}^{3}, \\ τ_{1} τ_{2}^{2}, τ_{1}^{2} τ_{2}} \end{matrix}$	$25.69 %$	$8.7 \times 10^{- 3}$
$\begin{matrix} {τ_{1}^{2}, τ_{2}^{2}, τ_{1} τ_{2}, τ_{1}^{3}, τ_{2}^{3}, \\ τ_{1} τ_{2}^{2}, τ_{1}^{2} τ_{2}, τ_{1}^{4}, τ_{2}^{4}} \end{matrix}$	$35.97 %$	$8.6 \times 10^{- 3}$
$\begin{matrix} {τ_{1}^{2}, τ_{2}^{2}, τ_{1} τ_{2}, τ_{1}^{3}, τ_{2}^{3}, \\ τ_{1} τ_{2}^{2}, τ_{1}^{2} τ_{2}, τ_{1}^{4}, τ_{2}^{4}, \\ τ_{1}^{3} τ_{2}, τ_{1} τ_{2}^{3}, τ_{1}^{2} τ_{2}^{2}} \end{matrix}$	$45.53 %$	$9.1 \times 10^{- 3}$

†

The average is calculated based on all successful estimations over all observation cases (varying observation starting time).

Table 2 indicates that on average, the minimal required observation length increases as additional irrelevant features are included to the feature set $F$ . This can be understood if we consider (45) and the rank non-decreasing property in Lemma 2: when a certain number of irrelevant features are added, the rank required for successful estimate will increase by the same amount, thus needing additional trajectory data points. Owing to increased observation length, the estimation accuracy is not much influenced by the additional irrelevant features. Thus, we conclude that the proposed incremental IOC algorithm applies to the presence of irrelevant features.

5.2.4. Parameter setting

We now discuss how to choose the rank threshold $γ$ in Algorithm 1. As in Algorithm 1 the rank index (41) for the recovery matrix is used to find the minimal required observation length, we first investigate how the rank index $κ (t, l)$ changes as the observation length l increases. We use the trajectory data in Figure 7 with added Gaussian noise of $σ = 10^{- 3}$ , $σ = 2 \times 10^{- 3}$ , and $σ = 10^{- 2}$ , respectively. The candidate features set here is $F = {τ_{1}^{2}, τ_{2}^{2}, τ_{1} τ_{2}}$ . We fix the observation start time $t = 0$ and increase the observation length l from 1 to T. The rank index $κ (t = 0, l)$ for different l is shown in Figure 10.

Fig. 10.

The rank index $κ (0, l)$ in (41) versus different observation length l under different noise levels.

From Figure 10, we can see that although $κ (t, l)$ has different scales at different noise levels, it increases in general as the observation length l increases. This can be understood if we compare our results with $κ (t, l)$ in noise-free cases: when there is no data noise, according to Lemma 2 and 3, as l increases, $κ (t, l)$ will first remain zero when $l < l_{\min}$ , then increase to infinity after $l \geq l_{\min}$ . In noisy settings, $κ (t, l)$ however will increase to a large finite value. From the plot, we can postulate that in practice choosing a larger threshold $γ$ will lead to a larger minimal required observation length $l_{\min}$ , thus more data points will be included into the recovery matrix to compute the estimate of the weights, which may finally improve the estimation accuracy (similar to results in Table 1). In what follows, we verify this postulation by showing how $γ$ affects the performance of Algorithm 1.

We add Gaussian noise $σ = 10^{- 3}$ to the trajectory data in Figure 7, and apply Algorithm 1 by starting the observation at all possible time steps, as performed in previous experiments. We vary $γ$ to show its influence on the average of the minimal required observation length $l_{\min}$ and the average of the estimation error $e_{ω}$ . The results are shown in Figure 11, from which we can observe that first, a larger $γ$ will lead to larger minimal required observation length; and, second, due to the increased minimal required observation, the corresponding estimation accuracy is improved because data noise or other error sources can be compensated by additional observation data. These facts thus prove our previous postulation based on Figure 10. Moreover, Figure 11 also shows as $γ$ exceeds a certain value, e.g., 200, continuously increasing $γ$ will not improve the recovery accuracy significantly. This suggests that the choice of $γ$ is not sensitive to the performance if $γ$ is large. Therefore, in practice it is possible to find a proper $γ$ without much manual effort such that both the estimation accuracy and computational cost are balanced.

Fig. 11.

Averaged $l_{\min}$ (upper panel) and averaged estimation error $e_{ω}$ (bottom panel) for different choices of $γ$ .

6. Conclusions

This article has considered the problem of learning an objective function from an observation of an incomplete trajectory. To achieve this goal, we have developed the recovery matrix, which establishes a relationship between trajectory segment data and the unknown weights of given candidate features. The rank of the recovery matrix indicates whether an incomplete trajectory observation is sufficient for obtaining a successful estimate of the weights. By investigating the properties of the recovery matrix, we have further demonstrated that additional observations may increase the rank of the recovery matrix, thus contributing to enabling the successful estimation, and that the IOC can be processed incrementally. Based on the recovery matrix, a method for using incomplete trajectory observations to estimate the weights of specified features has been established, and an incremental IOC algorithm has been developed by automatically finding the minimal required observation.

Footnotes

Appendix A. Proof of Lemma 1

Consider the recovery matrix $H (t, l)$ for the trajectory segment $ξ_{t : t + l} = (x_{t : t + l}^{*} u_{t : t + l}^{*})$ , with t being the observation starting time and l being the observation length. When a subsequent point $ξ_{t + l + 1} = (x_{t : t + l + 1}^{*}, u_{t : t + l + 1}^{*})$ is observed, from Definition 1, the updated recovery matrix is $H (t, l + 1) = [H_{1} (t, l + 1), H_{2} (t, l + 1)]$ , where

(64)

H_{1} (t, l + 1) = F_{u} (t, l + 1) F_{x}^{- 1} (t, l + 1) Φ_{x} (t, l + 1) + Φ_{u} (t, l + 1)

and

(65)

H_{2} (t, l + 1) = F_{u} (t, l + 1) F_{x}^{- 1} (t, l + 1) V (t, l + 1) .

Here $F_{x} (t, l + 1)$ , $F_{u} (t, l + 1)$ , $Φ_{x} (t, l + 1)$ , and $Φ_{u} (t, l + 1)$ , defined in (9)–(12), are updated as follows:

(66a)

Φ_{u} (t, l + 1) = [\begin{matrix} Φ_{u} (t, l) \\ \frac{\partial ϕ^{'}}{\partial u_{t + l}^{*}} \end{matrix}]

(66b)

Φ_{x} (t, l + 1) = [\begin{matrix} Φ_{x} (t, l) \\ \frac{\partial ϕ^{'}}{\partial x_{t : t + l + 1}^{*}} \end{matrix}]

(66c)

F_{u} (t, l + 1) = [\begin{matrix} F_{u} (t, l) & 0 \\ 0 & \frac{\partial f^{'}}{\partial u_{t + l}^{*}} \end{matrix}]

(66d)

\begin{matrix} F_{x}^{- 1} (t, l + 1) = {[\begin{matrix} F_{x} (t, l) & - V (t, l) \\ 0 & I \end{matrix}]}^{- 1} \\ = [\begin{matrix} F_{x}^{- 1} (t, l) & F_{x}^{- 1} (t, l) V (t, l) \\ 0 & I \end{matrix}] \end{matrix}

respectively. Here (66d) is based on the fact

{[\begin{matrix} A & B \\ C & D \end{matrix}]}^{- 1} = [\begin{matrix} A^{- 1} + A^{- 1} B K^{- 1} C A^{- 1} & - A^{- 1} B K^{- 1} \\ - K^{- 1} C A^{- 1} & K^{- 1} \end{matrix}]

with $K = D - C A^{- 1} B$ being the Schur complement of the above block matrix with respect to A. Combining (66a)–(66d), we have

(67)

\begin{matrix} \begin{matrix} H_{1} (t, l + 1) = F_{u} (t, l + 1) F_{x}^{- 1} (t, l + 1) Φ_{x} (t, l + 1) \\ + Φ_{u} (t, l + 1) \end{matrix} \\ \begin{matrix} = [\begin{matrix} F_{u} (t, l) F_{x}^{- 1} (t, l) Φ_{x} (t, l) + Φ_{u} (t, l) \\ \frac{\partial f^{'}}{\partial u_{t + l}^{*}} \frac{\partial ϕ^{'}}{\partial x_{t + l + 1}^{*}} + \frac{\partial ϕ^{'}}{\partial u_{t + l}^{*}} \end{matrix}] \\ + [\begin{matrix} F_{u} (t, l) F_{x}^{- 1} (t, l) V (t, l) \frac{\partial ϕ^{'}}{\partial x_{t + l + 1}^{*}} \\ 0 \end{matrix}] \end{matrix} \end{matrix}

Combining with (7)–(8), the above (67) becomes

(68)

H_{1} (t, l + 1) = [\begin{matrix} H_{1} (t, l) + H_{2} (t, l) \frac{\partial ϕ^{'}}{\partial x_{t + l + 1}^{*}} \\ \frac{\partial f^{'}}{\partial u_{t + l}^{*}} \frac{\partial ϕ^{'}}{\partial x_{t + l + 1}^{*}} + \frac{\partial ϕ^{'}}{\partial u_{t + l}^{*}} \end{matrix}]

Considering (66a)–(66d), we have

(69)

\begin{matrix} H_{2} (t, l + 1) = F_{u} (t, l + 1) F_{x}^{- 1} (t, l + 1) V (t, l + 1) \\ = [\begin{matrix} F_{u} (t, l) F_{x}^{- 1} (t, l) V (t, l) \frac{\partial f^{'}}{\partial x_{t + l + 1}^{*}} \\ \frac{\partial f^{'}}{\partial u_{t + l}^{*}} \frac{\partial f^{'}}{\partial x_{t + l + 1}^{*}} \end{matrix}] \\ = [\begin{matrix} H_{2} (t, l) \frac{\partial f^{'}}{\partial x_{t + l + 1}^{*}} \\ \frac{\partial f^{'}}{\partial u_{t + l}^{*}} \frac{\partial f^{'}}{\partial x_{t + l + 1}^{*}} \end{matrix}] \end{matrix}

Finally joining (68) and (69) and writing them in the matrix form lead to (30).

When $l = 1$ , that is, $ξ_{t : t + 1} = (x_{t : t + 1}^{*}, u_{t : t + 1}^{*})$ is available, we have $F_{x} (t, 1) = I$ , $F_{u} (t, 1) = \frac{\partial f^{'}}{\partial u_{t}^{*}}$ , $Φ_{x} (t, 1) = \frac{\partial ϕ^{'}}{\partial x_{t + 1}^{*}}$ , $Φ_{u} (t, 1) = \frac{\partial ϕ^{'}}{\partial u_{t}^{*}}$ , and $V (t, 1) = \frac{\partial f^{'}}{\partial x_{t + 1}^{*}}$ . According to the definition of recovery matrix in (6)–(8), we thus obtain (31). This completes the proof.

Appendix B. Proof of Lemma 2

From Lemma 1, we have

(70)

\begin{matrix} rank H (t, l + 1) \\ = rank [\begin{matrix} H_{1} (t, l) & H_{2} (t, l) \\ \frac{\partial ϕ^{'}}{\partial u_{t + l}^{*}} & \frac{\partial f^{'}}{\partial u_{t + l}^{*}} \end{matrix}] [\begin{matrix} I & 0 \\ \frac{\partial ϕ^{'}}{\partial x_{t + l + 1}^{*}} & \frac{\partial f^{'}}{\partial x_{t + l + 1}^{*}} \end{matrix}] \end{matrix}

If $det (\frac{\partial f}{\partial x_{t + l + 1}^{*}}) \neq 0$ , the last block matrix in (70) is non-singular. Consequently,

(71)

\begin{matrix} rank H (t, l + 1) = rank [\begin{matrix} H_{1} (t, l) & H_{2} (t, l) \\ \frac{\partial ϕ^{'}}{\partial u_{t + l}^{*}} & \frac{\partial f^{'}}{\partial u_{t + l}^{*}} \end{matrix}] \\ \geq rank [\begin{matrix} H_{1} (t, l) & H_{2} (t, l) \end{matrix}] \\ = rank H (t, l) \end{matrix}

Note that both (70) and the inequality (71) are independent of the choice of $ϕ$ . This completes the proof.

Appendix C. Proof of Lemma 3

We first prove (33). Without losing generality, we consider the feature set in (14). For any trajectory segment $ξ_{t : t + l} \subseteq ξ$ , from (27), we have known that there exists a costate $λ_{t + l + 1}^{*}$ such that

(72)

H (t, l) [\begin{matrix} \bar{ω} \\ λ_{t + l + 1}^{*} \end{matrix}] = 0

holds, where $\bar{ω} \neq 0$ is defined in (16). Thus, the nullity of $H (t, l)$ is at least one, which means

rank H (t, l) \leq r + n - 1

We then prove (34). When another relevant feature subset $\tilde{F}$ exists in $F$ with associated weight vector $\tilde{ω}$ , we can similarly construct a weight vector $\overset{⌣}{ω}$ corresponding to $col F$ as in (16); that is, the weights in $\overset{⌣}{ω}$ that correspond to $\tilde{F}$ are from $\tilde{ω}$ and otherwise zeros. Then following the similar derivations as from (17) to (27), we can obtain that there exists ${\overset{⌣}{λ}}_{t + l + 1} \in R^{n}$ such that

(73)

H (t, l) [\begin{matrix} \overset{⌣}{ω} \\ {\overset{⌣}{λ}}_{t + l + 1} \end{matrix}] = 0

As $\tilde{F} \neq F^{*}$ or $\tilde{ω} \neq c_{1} ω$ implies $\overset{⌣}{ω} \neq c_{2} \bar{ω}$ ( $c_{1}$ and $c_{2}$ are some non-zero scalars), based on (72) and (73), it follows that the nullity of $H (t, l)$ is at least two, i.e.,

rank H (t, l) \leq r + n - 2

This completes the proof.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This research is partly supported by the ERC Consolidator Grant Safe data-driven control for human-centric systems under grant agreement 864686 at Chair of Information-oriented Control, Technical University of Munich.

ORCID iD

Wanxin Jin

References

Abbeel

(2004) Apprenticeship learning via inverse reinforcement learning. In: International Conference on Machine Learning.

Abbeel

Coates

(2010) Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research 29(13): 1608–1639.

Aghasadeghi

Bretl

(2014) Inverse optimal control for differentially flat systems with application to locomotion modeling. In: IEEE International Conference on Robotics and Automation, pp. 6018–6025.

Andersson

JAE

Gillis

Horn

Rawlings

Diehl

(2019) CasADi – A software framework for nonlinear optimization and optimal control. Mathematical Programming Computation 11(1): 1–36.

Bajcsy

Losey

O’Malley

Dragan

(2018) Learning from physical human corrections, one feature at a time. In: ACM/IEEE International Conference on Human-Robot Interaction, pp. 141–149.

Bajcsy

Losey

O’Malley

Dragan

(2017) Learning robot objectives from physical human interaction. Proceedings of Machine Learning Research 78: 217–226.

Berret

Chiovetto

Nori

Pozzo

(2011) Evidence for composite cost functions in arm movement planning: An inverse optimal control approach. PLoS Computational Biology 7(10): e1002183.

Bertsekas

(1995) Dynamic Programming and Optimal Control, Vol. 1. Belmont, MA: Athena Scientific.

Boyd

Vandenberghe

(2004) Convex Optimization. Cambridge: Cambridge University Press.

10.

Clever

Schemschat

Felis

Mombaur

(2016) Inverse optimal control based identification of optimality criteria in whole-body human walking on level ground. In: IEEE International Conference on Biomedical Robotics and Biomechatronics, pp. 1192–1199.

11.

Englert

Vien

Toussaint

(2017) Inverse KKT: Learning cost functions of manipulation tasks from demonstrations. The International Journal of Robotics Research 36: 1474–1488.

12.

Jin

Kulić

Lin

JFS

Mou

Hirche

(2019) Inverse optimal control for multiphase cost functions. IEEE Transactions on Robotics 35(6): 1387–1398.

13.

Jin

Murphey

Mou

(2020) Learning from incremental directional corrections. arXiv preprint arXiv:2011.15014.

14.

Johnson

Aghasadeghi

Bretl

(2013) Inverse optimal control for deterministic continuous-time nonlinear systems. In: Conference on Decision and Control, pp. 2906–2913.

15.

Kalman

(1964) When is a linear control system optimal? Journal of Basic Engineering 86(1): 51–60.

16.

Keshavarz

Wang

Boyd

(2011) Imputing a convex objective function. In: IEEE International Symposium on Intelligent Control, pp. 613–619.

17.

Kolter

Abbeel

(2008) Hierarchical apprenticeship learning with application to quadruped locomotion. In: Advances in Neural Information Processing Systems, pp. 769–776.

18.

Kuderer

Gulati

Burgard

(2015) Learning driving styles for autonomous vehicles from demonstration. In: IEEE International Conference on Robotics and Automation, pp. 2641–2646.

19.

Mainprice

Hayne

Berenson

(2015) Predicting human reaching motion in collaborative tasks using inverse optimal control and iterative re-planning. In: IEEE International Conference on Robotics and Automation, pp. 885–892.

20.

Mainprice

Hayne

Berenson

(2016) Goal set inverse optimal control and iterative replanning for predicting human reaching motions in shared workspaces. IEEE Transactions on Robotics 32(4): 897–908.

21.

Molloy

Ford

Perez

(2018) Finite-horizon inverse optimal control for discrete-time nonlinear systems. Automatica 87: 442–446.

22.

Molloy

Tsai

Ford

Perez

(2016) Discrete-time inverse optimal control with partial-state information: A soft-optimality approach with constrained state estimation. In: IEEE Conference on Decision and Control, pp. 1926–1932.

23.

Mombaur

Truong

Laumond

(2010) From human to humanoid locomotion - an inverse optimal control approach. Autonomous robots 28(3): 369–383.

24.

Russell

(2000) Algorithms for inverse reinforcement learning. In: International Conference on Machine Learning, pp. 663–670.

25.

Papadopoulos

Bascetta

Ferretti

(2016) Generation of human walking paths. Autonomous Robots 40(1): 59–75.

26.

Park

Levine

(2013) Inverse optimal control for humanoid locomotion. In: Robotics: Science and Systems.

27.

Pérez-D’Arpino

Shah

(2015) Fast target prediction of human reaching motion for cooperative human–robot manipulation tasks using time series classification. In: IEEE International Conference on Robotics and Automation. IEEE, pp. 6175–6182.

28.

Pontryagin

Boltyansky

Gamkrelidze

Mischenko

(1962) The Mathematical Theory of Optimal Processes. Chichester: Interscience.

29.

Puydupin-Jamin

Johnson

Bretl

(2012) A convex approach to inverse optimal control and its application to modeling human locomotion. In: IEEE International Conference on Robotics and Automation, pp. 531–536.

30.

Ratliff

Bagnell

Zinkevich

(2006) Maximum margin planning. In: International Conference on Machine Learning, pp. 729–736.

31.

Spong

Vidyasagar

(2008) Robot Dynamics and Control. New York: John Wiley & Sons.

32.

Vasquez

Okal

Arras

(2014) Inverse reinforcement learning algorithms and features for robot navigation in crowds: An experimental comparison. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1341–1346.

33.

Ziebart

Maas

Bagnell

Dey

(2008) Maximum entropy inverse reinforcement learning. In: AAAI Conference on Artificial Intelligence, pp. 1433–1438.