convex-optimization – mathstoshare

I used to have trouble understanding why inverting an \(n \times n\) matrix required the same order of operations as solving an \(n \times n\) system of linear equations. Specifically, if \(A\) is an \(n\times n\) invertible matrix and \(b\) is a length \(n\) vector, then computing \(A^{-1}\) and solving the equation \(Ax = b\) can both be done in \(O(n^3)\) floating point operations (flops). This surprised me because naively computing the columns of \(A^{-1}\) requires solving the \(n\) systems of equations

\(Ax = e_1, Ax=e_2,\ldots, Ax = e_n,\)

where \(e_1,e_2,\ldots, e_n\) are the standard basis vectors. I thought this would mean that calculating \(A^{-1}\) would require roughly \(n\) times as many flops as solving a single system of equations. It was only in a convex optimization lecture that I realized what was going on.

To solve a single system of equations \(Ax = b\), we first compute a factorization of \(A\) such as the LU factorization. Computing this factorization takes \(O(n^3)\) flops but once we have it, we can use it solve any new system with \(O(n^2)\) flops. Now to solve \(Ax=e_1,Ax=e_2,\ldots,Ax=e_n\), we can simply factor \(A\) once and the perform \(n\) solves using the factorization each of which requires an addition \(O(n^2)\) flops. The total flop count is then \(O(n^3)+nO(n^2)=O(n^3)\). Inverting the matrix requires the same order of flops as a single solve!

Of course, as John D Cook points out: you shouldn’t ever invert a matrix. Even if inverting and solving take the same order of flops, inverting is still more computational expensive and requires more memory.

Concavity of the squared sum of square roots is about an interesting homework problem from the same convex optimization course.

The \(p\)-norm of a vector \(x\in \mathbb{R}^n\) is defined to be:

\(\displaystyle{\Vert x \Vert_p = \left(\sum_{i=1}^n |x_i|^p\right)^{1/p}}.\)

If \(p \ge 1\), then the \(p\)-norm is convex. When \(0<p<1\), this function is not convex and not actually a norm. In fact, the function \(x \mapsto \Vert x \Vert_p\) actually concave when \(0 < p < 1\) and all the entries of \(x\) are non-negative. This can be seen in the feature image for this post which shows that the contours of this function are concave for \(n = 2\).

On a recent exam for the course Convex Optimization, we were asked to prove this when \(p = 1/2\). In this special case, the mathematics simplifies nicely.

When \(p = 1/2\), the \(p\)-norm can be described as the squared sum of square roots. Specifically,

\(\displaystyle{\Vert x \Vert_{1/2} = \left(\sum_{i=1}^n \sqrt{|x_i|} \right)^2}.\)

Note that we can expand the square and rewrite the function as follows

\(\displaystyle{\Vert x \Vert_{1/2} = \sum_{i=1}^n\sum_{k=1}^n\sqrt{\left|x_i\right|} \sqrt{|x_k|} =\sum_{i=1}^n\sum_{k=1}^n \sqrt{|x_ix_k|}}.\)

If we restrict to \(x \in \mathbb{R}^n\) with \(x_i \ge 0\) for all \(i\), then this function simplifies to

\(\displaystyle{\Vert x \Vert_{1/2} =\sum_{i=1}^n\sum_{k=1}^n \sqrt{x_ix_k}},\)

which is a sum of geometric means. The geometric mean function \((u,v) \mapsto \sqrt{uv}\) is concave when \(u,v \ge 0\). This can be proved by calculating the Hessian of this function and verifying that it is negative semi-definite.

From this, we can conclude that each function \(x \mapsto \sqrt{x_ix_k}\) is also concave. This is because \(x \mapsto \sqrt{x_ix_k}\) is a linear function followed by a concave function. Finally, any sum of concave functions is also concave and thus \(\Vert x \Vert_{1/2}\) is concave.

Hellinger distance

A similar argument can be used to show that Hellinger distance is a convex function. Hellinger distance, \(d_H(\cdot,\cdot)\) is defined on pairs of probability distributions \(p\) and \(q\) on a common set \(\{1,\ldots,n\}\). For such a pair,

\(\displaystyle{d_H(p,q) = \sum_{i=1}^n \left(\sqrt{p_i}-\sqrt{q_i}\right)^2},\)

which certainly doesn’t look convex. However, we can expand the square and use the fact that \(p\) and \(q\) are probability distributions. This shows us that Helligener distance can be written as

\(\displaystyle{d_H(p,q) = \sum_{i=1}^n p_i – 2\sum_{i=1}^n \sqrt{p_iq_i}+\sum_{i=1}^n q_i = 2 – 2\sum_{i=1}^n\sqrt{p_iq_i}}.\)

Again, each function \((p,q) \mapsto \sqrt{p_iq_i}\) is concave and so the negative sum of such functions is convex. Thus, \(d_H(p,q)\) is convex.

The course

As a final comment, I’d just like to say how much I am enjoying the class. Prof. Stephen Boyd is a great lecturer and we’ve seen a wide variety of applications in the class. I’ve recently been reading a bit of John D Cook’s blog and agree with all he says about the course here.

Tag: convex-optimization

Solving a system of equations vs inverting a matrix

Related posts

Concavity of the squared sum of square roots

Hellinger distance

The course