Iterative Linear Quadratic Regulator Design for Nonlinear Quadrotor System

The main difference between iLQR and IQR can be thought of in this way: LQR deals with optimal control which is to provide an optimal solution over the entire state space that being said, when a goal is specified, you can start from anywhere to reach the target in an optimal way defined by the cost function you desired. The optimality in a sense is that if you take any other path more costs would be incurred. On the other hand, iLQR deals with trajectory optimization, which can be viewed as a variant of optimal control in a sense that we don’t deal with the entire state space but the time horizon instead. In this case, we are given both the starting point and end point, and the goal is to design a path or trajectory to minimize the cost function we design while respecting the dynamics of the system.

Because we are now considering the inifinte horizon in this case, which makes sense because we would like the object to finish a trajectory over a finite period. We can transcribe the problem into an optimization program:

\[\begin{aligned} &\min_{x, u} \ l_f(x_N) + \sum_{k=0}^{N-1} l(x_k, u_k)\\ &\text{s.t. } \ x_{k+1} = f(x_k, u_k) \end{aligned}\]

The neat thing about the expression above is that although it seems to have a dynamic constraint, it can be treated as an unconstrained optimization problem by simulating forward with the input trajectory. Note that we are dealing with nonlinear discrete dynamics in continuous time. Note we are going to discretize time horizon later.

The idea of iterative is to first guess a trajectory input and we can iteratively solve for the optimal trajectory by simulating forward and solving LQR backwards. Say we have our initial guess of input trajectory \(\{ \hat{u}_k \}\), we can choose an integration scheme to obtain \(\{ \hat{x}_k \}\).

We are going to first linearize the nonlinear system about the trajectory. Unlike LQR where we approximate at \(\{ x^d, u^d\}\) which are desired trajectory provided, in iLQR we approximate at the current guess first. Note that we made assumption that the system is discrete. Hence, it is natural to discretize the system if it is not. Then we approximate the nonlinear discrete system via Taylor series up to 2nd order. The proper way is to first approximate the nonlinear continuous system derived from the equation of motion via linearization and then discretize the system given the time difference:

\[\begin{aligned} x_{k+1 } & = f(x_k, u_k)\\ &\approx f(\hat{x}_k, \hat{u}_k) + \nabla_x f(\hat{x}_k, \hat{u}_k)^T(x_k - \hat{x}_k) + \nabla_u f(\hat{x}_k, \hat{u}_k)^T(u_k-\hat{u}_k)\\ & = \hat{x}_{k+1} + A \delta x_k + B \delta u_k \end{aligned}\]

We rearrange the equation to express the first order approximation of the dynamics in error coordinate:

\[\delta x_{k+1} = A_k \delta x_k + B_k \delta u_k\]

Note after discretization, the original continuous time system is now discrete time discrete dynamics linear system. iLQR also deviates from the LQR in a way that both its local costs and cost-to-go does not assume any form, and we have to make approximate up to second order using taylor seris so that costs contain both linear and quadratic terms. Like we did previously in approxmiating dynamics at the trajectory, we did same here for cost:

\[\begin{aligned} V(x_k) &\approx V(\hat{x}_k) + g^T(x_k - \hat{x}_k) + \frac{1}{2} (x_k - \hat{x}_k)^TH(x_k - \hat{x}_k)\\ &= V(\hat{x}_k) + g_k^T\delta x_k + \frac{1}{2} \delta x_k^TH_k\delta x_k\\ \end{aligned}\]

the nonlinear local cost can be approximated as follows:

\[\begin{aligned} l(x_k, u_k) \approx &l(\hat{x}_k, \hat{u}_k) + l_x^T\delta x_k + l_u^T\delta u_k + \frac{1}{2} \delta x_k^T l_{xx} \delta x_k + \frac{1}{2} \delta u_k^T l_{uu} \delta u_k + \\ &\frac{1}{2} \delta x_k^T l_{xu} \delta u_k + \frac{1}{2} \delta u_k^T l_{ux} \delta x_k \end{aligned}\]

We now revisit the famous Bellamn equation and replace all the terms with our approximations:

\[\begin{aligned} V(x_k)= &\min_{u} l(x_k, u_k) + V(f(x_k, u_k))\\ = &\min_{u} l(\hat{x}_k, \hat{u}_k) + l_x^T\delta x_k + l_u^T\delta u_k + \frac{1}{2} \delta x_k^T l_{xx} \delta x_k + \frac{1}{2} \delta u_k^T l_{uu} \delta u_k \\ & + \frac{1}{2} \delta x_k^T l_{xu} \delta u_k + \frac{1}{2} \delta u_k^T l_{ux} \delta x_k + V(\hat{x}_{k+1}) + g_{k+1}^T\delta x_{k+1} + \frac{1}{2} \delta x_{k+1}^TH_{k+1}\delta x_{k+1}\\ = & \min_{u} l(\hat{x}_k, \hat{u}_k) + V(\hat{x}_{k+1}) + l_x^T\delta x_k + l_u^T\delta u_k + \frac{1}{2} \delta x_k^T l_{xx} \delta x_k + \frac{1}{2} \delta u_k^T l_{uu} \delta u_k\\ & + \frac{1}{2} \delta x_k^T l_{xu} \delta u_k + \frac{1}{2} \delta u_k^T l_{ux} \delta x_k + g_{k+1}^T( A_k \delta x_k + B_k \delta u_k) \\ & + \frac{1}{2} (A_k \delta x_k + B_k \delta u_k)^T H_{k+1} ( A_k \delta x_k + B_k \delta u_k) \end{aligned}\]

Look scary? But we can also observe that this is expression is quadratic in both \(\delta x_k \text{ and } \delta u_k\), which means we can rearrange the equation as follows:

\[\delta Q_k = \min _{\delta u} \ \frac{1}{2} \begin{bmatrix} \delta x_k\\ \delta u_k\end{bmatrix}^T Q_k \begin{bmatrix} \delta x_k\\ \delta u_k\end{bmatrix} + q_k^T \begin{bmatrix} \delta x_k\\ \delta u_k\end{bmatrix}\]

where

\[Q_k = \begin{bmatrix} A_k^T H_{k+1}A_k + l_{xx} & A^T_k H_{k+1} B_k + l_{xu}\\ B_k^T H_{k+1}A_k + l_{ux} & B^T_k H_{k+1} B_k + l_{uu}\\ \end{bmatrix} \quad q_k = \begin{bmatrix} A_k^T g_{k+1} + l_x \\ B_k^T g_{k+1} + l_u \end{bmatrix}\]

If we now take the deriative of the expression above with respect to \(\delta_u\), we see that terms quadratic in \(\delta x\) go away and we end up with:

\[0 = Q_{uu} \delta u_k^* + \frac{1}{2} (\delta x_k^T Q_{xu} + Q_{ux}\delta x_k) + q_u\]

We rearrange the equation for the optimal control

\[\begin{aligned} \delta u_k ^* &= -Q_{uu}^{-1} (Q_{ux}\delta x_k + q_u)\\ &= -Q_{uu}^{-1} Q_{ux} \delta x_k - Q_{uu}^{-1}q_u \\ &= K_k \delta x_k + d_k \end{aligned}\]

This is again just as in LQR in a linear feedback controller with an additive term. We can observe that each element here is approximated at the current time stamp \(k\), EXCEPT the hessian matrix and gradient vector of the approximated nonlinear cost-to-go is at the proceding time stamp \(k+1\). This is exactly the reason we are running iLQR backwards in the opposite direction of the forward simulation. Once we obtain the gain and and additive term in the current time stamp \(k\) we can actually express the hessian and gradient for the preceding time stamp \(k-1\)

Just as HJB where we plugged the optimal control input back into HJB to obtain the algebraic ricatti equation and solve for matrix \(S\), we are going to do the same but to obtain expressions for the hessian and gradient.

\[\begin{aligned} \delta Q_k = & \frac{1}{2} \delta x_k^T Q_{xx} \delta x_k + \frac{1}{2} (K_k \delta x_k + d_k)^T Q_{uu} (K_k \delta x_k + d_k) \\ & + \frac{1}{2} \delta x_k^T Q_{xu} (K_k \delta x_k + d_k) + \frac{1}{2}(K_k \delta x_k + d_k)^T Q_{ux} \delta x_k \\ & + q_x^T \delta x_k + q_u^T (K_k \delta x_k + d_k) \\ = & \frac{1}{2} \delta x_k^T (Q_{xx}+ K_k^T Q_{uu} K_k + 2Q_{xu}K_k ) \delta x_k\\ & + (d_kQ_{uu}K_k + d_kQ_{ux} + q_x + q_u K_k) \delta x_k +\frac{1}{2} d_k^T Q_{uu} d_k + q_u^T d_k \end{aligned}\]

The idea is to recall that in the beginning we approximate the cost-to-go up to 2nd order, which is an additive term plus a lienar term in \(\delta x\) and aquadratic term in \(\delta x\). Hence, once we plug in the optimal control, we should expect the same structure.

\[\begin{gathered} \frac{1}{2} \delta x_k^T H_{k} \delta x_k = \delta x_k^T (Q_{xx} + Q_{ux}Q_{uu}^{-1}Q_{uu}Q_{uu}^{-1}Q_{ux} - 2Q_{xu}Q_{uu}^{-1}Q_{ux}) \delta x_k\\ g_{k}^T \delta x_k = (q_uQ_{uu}^{-1}Q_{uu}Q_{uu}^{-1}Q_{ux} -q_uQ_{uu}^{-1}Q_{ux} + q_x - q_uQ_{uu}^{-1}Q_{ux}) \delta x_k \end{gathered}\]

Hence,

\[\begin{gathered} H_k = Q_{xx} -Q_{xu}Q_{uu}^{-1}Q_{ux} \\ g_k = q_x -Q_{ux}Q_{uu}^{-1}q_u \end{gathered}\]

We can avoid inverting \(Q_{uu}\) by adding terms \((Q_{uu}^{-1}Q_{uu})\) in both terms above

\[\begin{gathered} H_k = Q_{xx} -Q_{xu}Q_{uu}^{-1}Q_{uu}Q_{uu}^{-1}Q_{ux} = Q_{xx} - K_k^T Q_{uu} K_k\\ g_k = q_x -Q_{ux}Q_{uu}^{-1}Q_{uu}Q_{uu}^{-1}q_u = q_x - K_k^T Q_{uu} d_k \end{gathered}\]

Let’s now wrap up the complete backward pass. From above, we conclude that \(K_k\) and \(d_k\) at the current time stamp \(k\) requires \(H_{k+1}\) and \(g_{k+1}\) from the next time stamp \(k+1\). And once we have \(K_k\) and \(d_k\), we can compute \(H_{k}\) and \(g_{k}\) at the current time stamp \(k\). Starting at the final time stamp, we know that the cost-to-go would be \(V(x_N) = l_f(x_N)\) only. Hence, the gradient and Hessian would be easy to compute. Regardless if \(l_f(\cdot)\) is linear or nonlinear, we can always compute its gradient and hessian with respect to state easily. For example, if quadratic:

\[l_f(x_N) = \frac{1}{2} (x_N-x_N^d)^T Q_f (x_N-x_N^d)\]

We can easily compute the gradient and hessian:

\[g_N = Q_f (x_N-x_N^d) \quad H_N = Q_f\]

We can then start propagating backwards to obtain the gain at the preceding time by computing the rest of the term in matrix \(Q\) and then plug each term to obtain the gain. At the end, we would have a trajectory of gains and additive terms from \(N-1\) to 0. We can compute the feedback control and add this term to the previous guess to generate a new input control. We can then simulate forward again to reach a new trajectory and perform backward pass again until the cost converges.

We now illustrate an implementation on a 2D quadrotor. The video shows that even though the initial guess was off, the algorithm was still able to iteratively correct the path and find the optimal trajectory.

Another neat thing about ILQR is that it automatically provides LQR controller as well when the gain from the solution is given; i.e. we are given a linear feedback controller.

We can see how iLQR gives us a linear feedback controller that allows the quadrotor to start from a different starting point but eventually converges to the actual trajectory and reach the target location.