Variational refinement -- an introduction and derivation The variational refinement is probably the most difficult part of the algorithm to understand, in part because the description in most papers are very heavy on notation and rather light on exposition. I've tried to give a somewhat friendlier introduction to this specific algorithm below. The general idea is fairly simple; we try to optimize the flow field as a whole, by minimizing some mathematical notion of badness expressed as an energy function. The one used in the dense inverse search paper [Kroeger16; se references below] has this form: E(U) = int( σ Ψ(E_I) + γ Ψ(E_G) + α Ψ(E_S) ) dx where Ψ(a²) = sqrt(a² + ε²) for some small constant ε = 0.001, and σ, γ, α are empirically set weighting constants. (We'll get to what the different enery terms are in a minute.) The integral is, for all practical purposes, just a sum over all the pixels in the flow. In general, such formulas are nonconvex and highly nonlinear, so we cannot hope to find a global minimum -- but if we start from the flow generated by the motion search, we can at least hope to make it somehow better by walking towards a local minimum. (In fact, there are many methods for optical flow that work _only_ by such minimization, so the word “refinement” is maybe not doing the method justice. One could just as well say that the motion search is a way of finding a reasonable starting point for the optimization.) The dense inverse search paper [Kroeger16] sets up the energy terms as described by some motion tensors and normalizations, then says simply that it is optimized by “θ_vo fixed point iterations and θ_vi iterations of Successive Over Relaxation (SOR) for the linear system”. It's not immediately obvious what this means, but it gives a reference to [Brox04]. However, that paper describes a numerical approximation scheme that is _far_ more complicated than what the DIS code actually does. Rather, one must look at the other main reference they are giving, which is [Weinzaepfel13], describing a system called DeepFlow. DIS borrows most of the exposition and code for its variational refinement from DeepFlow, just removing some terms and fixing up a few issues here and there. (There are some slight differences in the paper, like the use of ∂z instead of ∂t, but that looks mostly like an error to me.) Unfortunately, that paper in turn refers to [Brox11], which appears no more useful in clearing up the notation to me. However, digging down in the references, finally one finds [Zimmer11], which is where the tensor notation appears to come from. This allows us to look at the first term in the energy, E_I, which comes from the intensity constant assumption. The basic idea is optical flow nearly by definition should preserve intensity after the warp: I_0(x + u) = I_1(x) where I_0 is the first picture, I_1 is the second, x is any 2D coordinate and u is the flow at x (which we are optimizing over). In general, we'll be optimizing over the entire field of u (potentially hundreds of thousands of values), but we'll be looking mostly at individual points, so we'll skip the coordinates when we can (e.g. we write u instead of or u(x, y)). u is of course the 2D flow, although often, we'll write its components separately as u and v instead of as a vector u. Before we go further, we need to add some more notation: * I_x is the partial derivative of I with respect to x (at some point), and similarly for I_y. These do not depend on u, so they can be precalculated before the optimization. * I_xx is the double partial derivative of I, and similar for I_yy and I_xy (the latter is the same as I_yx). * I_t is the temporal derivative of I, ie. in practice just I_t(x) = I_1(x) - I_0(x). Returning now to our original assertion: I_0(x + u) = I_1(x) Classically in optical flow, one assumes that the flow is smooth and linear around the point x, which allows one to approximate this equation by I_x u + I_y v + I_t = 0 This is usually simply called “the optical flow constraint”, and gives rise to a very natural part of the energy: E_I = I_x u + I_y v + I_t Remember that we send E_I through the function Ψ(a²) = sqrt(a² + ε²), so clearly Ψ(E_I) will be minimized if indeed E_I is zero. At this point, many papers start talking about Euler-Lagrange multivariate equations, which is a fairly daunting concept (at least the Wikipedia page is suitable for scaring small children). However, for the first two terms, we don't need its general form, and it reduces to something much simpler; just differentiate the energy by u and equate the result to zero (finding some minimum; it can't be a maximum, since *wave hands intensely*). Then differentiate the energy by v and set that to zero, too; now you have two equations in two unknowns (or, since we're optimizing over a field, maybe 500k equations in 500k unknowns -- although the equation set will be very sparse), which is hopefully solvable using linear methods. We'll look at what this gives for E_I in a moment, then try to apply the same notions to E_G and E_S later. First we modify E_I a bit by adding some normalization: E_I = β_0 (I_x u + I_y v + I_t) where β_0 = 1/(abs(∇I)² + 0.01). Note that β_0 depends on I only, so for the purposes of optimizing u, it's a constant and can be precomputed across I. (β_0 will, of course, depend on x, but so do all the other terms in the equation.) Now we give it to Maple, differentiating first by u and then by v: > M := (u,v) -> B_0 * (I_x * u + I_y * v + I_t); M := (u, v) -> B_0 (I_x u + I_y v + I_t) > diff(sqrt(M(u,v)^2 + e), u); 2 B_0 (I_x u + I_y v + I_t) I_x ------------------------------------ 2 2 1/2 (B_0 (I_x u + I_y v + I_t) + e) > diff(sqrt(M(u,v)^2 + e), v); 2 B_0 (I_x u + I_y v + I_t) I_y ------------------------------------ 2 2 1/2 (B_0 (I_x u + I_y v + I_t) + e) So these are the two expressions to be set to zero (for each point). We'll notice immediately that this isn't very linear in u and v, so here's where the “fixed point iterations” come in; we simply assume that our previous values for u and v are approximately good enough for the denominator, and optimize them in the numerator only. Then we get new values that are hopefully a bit closer, which we can then use for the denominator, and so on. (This is seemingly an old technique; [Brox05] cites [Ciarlet78]. It is justifiable in the sense that the only thing really held constant is the derivative of the penalizer.) In other words, if we define the constant k1 = β_0² / sqrt(β_0² (I_x u' + I_y v' + I_t)² + ε²) (where u' and v' are the guesses for u and v from the previous iteration) we have the much more manageable k1 I_x² u + k1 I_x I_y v = - k1 I_t I_x k1 I_x I_y u + k1 I_y² v = - k1 I_t I_y ie., two linear equations in u and v. Now, you will notice two immediate problems by this equation set: * The factor k1 is completely useless, since it's just multiplied in everywhere. * The set of equations is colinear (the determinant of the matrix is zero), and thus there is an infinite number of possible solutions—this is known as the so-called “aperture problem”. It shouldn't be surprising, though, as we cannot expect that starting with a single constraint should allow us to solve for two unknowns. However, both problems will go away as soon as we start adding more terms, so let's look at the gradient constancy term E_G next. It is fairly similar to the brightness constancy term, except it uses the (spatial) gradient instead of intensity: ∇I_0(x + u) = ∇I_1(x) or equivalently (by definition): (∂I/∂x)_0(x + u) = (∂I/∂x)_1(x) (∂I/∂y)_0(x + u) = (∂I/∂y)_1(x) The idea is that this is more robust to changes in lighting. It doesn't replace the intensity term, but augments it; the weighting constants σ and γ control their relative importance. Also note that this actually gives us two independent equations, unlike the brightness constancy term. However, it is not obvious at all how to discretize this. In particular, most papers, including [Brox04], appear to want _not_ to make any linear assumptions of the flow in this case, and end up with tons of terms. (The DIS and DeepFlow papers do, again, use some tensor notation that I do not understand, but I'm not convinced it actually contains any of the discretization.) Yet more paper searching eventually turns up [Fahad07], which simply states that the discretized versions of these equations are: I_xx u + I_xy v + I_xt = 0 I_yx u + I_yy v + I_yt = 0. which seems to match well what the DIS code uses. Note that even though this is an equation set equal to zero, we can't just solve for them; we need to make (penalized, normalized) energy terms and add them to the other terms. This gives E_G = β_x (I_xx u + I_xy v + I_xt) + β_y (I_yx u + I_yy v + I_yt) with normalization terms β_x = 1 / (abs(∇(I_x))² + 0.01) (∇(I_x) is the gradient of ∂I/∂x) β_y = 1 / (abs(∇(I_y))² + 0.01) (The DIS paper writes ∇I_dx and ∇I_dy instead of ∇I_x and ∇I_y, but I believe that's a typo; the DeepFlow paper says ∇I_x and ∇I_y.) The papers both write that Ψ(E_G) is used, which would mean that the penalized term is E_G = sqrt((β_x (I_xx u + I_xy v + I_xt) + β_y (I_yx u + I_yy v + I_yt))² + ε²) but that isn't what the code actually does. Instead, it seems that the two terms are squared independently: E_G = sqrt((β_x (I_xx u + I_xy v + I_xt))² + (β_y (I_yx u + I_yy v + I_yt))² + ε²) Both are solvable just fine, and it probably does not matter all that much which we use in practice (although [Zimmer11] suggests that if we are using multichannel images, we should penalize the three channels separately), but we follow what the code actually does here. We can differentiate them and equate them to zero as before: > M_x := (u,v) -> B_x * (I_xx * u + I_xy * v + I_xt); M_x := (u, v) -> B_x (I_xx u + I_xy v + I_xt) > M_y := (u,v) -> B_y * (I_xy * u + I_yy * v + I_yt); M_y := (u, v) -> B_y (I_xy u + I_yy v + I_yt) > diff(sqrt(M_x(u,v)^2 + M_y(u,v)^2 + e), u); 2 2 2 (I_xx u + I_xy v + I_xt) B_x I_xx + 2 B_y (I_xy u + I_yy v + I_yt) I_xy --------------------------------------------------------------------------- 2 2 2 2 1/2 2 ((I_xx u + I_xy v + I_xt) B_x + B_y (I_xy u + I_yy v + I_yt) + e) > diff(sqrt(M_x(u,v)^2 + M_y(u,v)^2 + e), v); 2 2 2 (I_xx u + I_xy v + I_xt) B_x I_xy + 2 B_y (I_xy u + I_yy v + I_yt) I_yy --------------------------------------------------------------------------- 2 2 2 2 1/2 2 ((I_xx u + I_xy v + I_xt) B_x + B_y (I_xy u + I_yy v + I_yt) + e) Using the same fixed-point scheme where we hold the terms in the denominator constant and equal to last iteration's values, we get a new common constant k2 = 1 / sqrt(β_x² (I_xx u' + I_xy v' + I_xt)² + β_y² (I_xy u' + I_yy v' + I_yt)²) and for brevity k_x = k2 β_x² k_y = k2 β_y² and thus, collecting terms for u and v, we get the two equations: (k_x I_xx² + k_y I_xy²) u + (k_x I_xx I_xy + k_y I_xy I_yy) v = - k_x I_xx I_xt - k_y I_xy I_yt (k_x I_xx I_xy + k_y I_xy I_yy) u + (k_x I_xy² + k_y I_yy²) v = - k_x I_xy I_xt - k_y I_yy I_yt which is linear in u and v, not colinear (unless we are extremely unlucky), and can be easily solved. Of course, for optimizing the weighted sum σ Ψ(E_I) + γ Ψ(E_G), we just add the two equation sets pairwise with appropriate weights. There's a small discrepancy here: The equations suggest that we should be be squaring the normalization terms β_0², β_x², β_y²; however, the code does not appear to do so. It's possible that they were intended to be added outside of the penalization, e.g. Ψ(a²) = sqrt(β a² + ε²), but given that these come from [Zimmer11], which mentions nothing of the sort, I'll just have to assume that this is an implementation mishap. The final smoothness term the one that binds the flow field together as a whole so that we don't have WxH completely independent equations (with its positive and negative sides, of course). It is the simplest in terms of notation, but it requires the full power of the Euler-Lagrange equations to minimize, so we'll need to figure that part out. E_S = abs(∇u)² + abs(∇v)² or E_S = (u_x² + u_y²) + (v_x² + v_y²) The penalized form used in the DeepFlow notation, contrary to what you'd expect from the paper, is: E_S = sqrt(u_x² + u_y² + v_x² + v_y² + ε²) How would one go about to minimize such an expression by u? (We'll deal with v later.) It's perhaps no big surprise that the expression involves double derivatives, but the full form involves the Euler-Lagrange equations. They allow us to minimize expressions that contain x, y, u(x, y) _and_ the partial derivatives u_x(x, y) and u_y(x, y), although the answer becomes a differential equation. The Wikipedia page is, unfortunately, not very beginner-friendly, but the general idea is: Differentiate the expression by u_x (yes, differentiating by a partial derivative!), negate it, and then differentiate the result by x. Then do the same thing by u_y and y, add the two results together and equate to zero. Mathematically (https://en.wikipedia.org/wiki/Euler%E2%80%93Lagrange_equation#Several_functions_of_several_variables_with_single_derivative): ∂E/∂u - ∂/∂x (∂E/∂u_x) - ∂/∂y (∂E/∂u_y) = 0 The first term disappears, since we don't have a non-differentiated u(x, y) in E_S. (Previously, the two _other_ terms would disappear, because we didn't have u_x or u_y in E_I or E_G.) This means we get - ∂/∂x (u_x / sqrt(u_x² + u_y² + v_x² + v_y² + ε²)) - ∂/∂y (u_y / sqrt(u_x² + u_y² + v_x² + v_y² + ε²)) = 0 (We don't remove the minus signs since this is supposed to be added to all the other terms.) This is what's called an _anisotropic diffusion_ (or Perona–Malik diffusion) equation, and is extensively described in literature. It has the effect of smoothing the flow more in some places than others; in particular, it does not smooth as strongly near edges, so it is edge-preserving. (It's a bit odd to call it anisotropic, since it does smooth equally in all directions; [Brox05] calls it vector-valued diffusion.) We'd love to our usual trick of keeping the nonlinear terms in the denominator constant, but alas, we can't do that yet, since it's under the differentiation operator; this factor has to be discretized together with u before we can treat it as a constant. So instead, we'll define it as a function (called the _diffusivity_ at the given point): g(x, y) = 1 / sqrt(u_x² + u_y² + v_x² + v_y² + ε²) = 0 which gives us - ∂/∂x ( g(x, y) u_x ) - ∂/∂y ( g(x, y) u_y ) = 0 We'll also have a similar equation for minimizing v, of course: - ∂/∂x ( g(x, y) v_x ) - ∂/∂y ( g(x, y) v_y ) = 0 There's no normalization term β here, unlike the other terms; DeepFlow2 adds one, but we're not including it here. At this point, we make a tweak. This seemingly goes back to at least [Brox04], which also makes the same tweak to all the other terms (which we don't, but see below). We split u (and v) into something based on the original value plus a differential du (and dv), and then solve for du (or dv) instead. (In math-speak, we are moving to an implicit method, which is often more numerically stable.) In other words, u(x, y) = u0(x, y) + du(x, y) where u0(x, y) is the initial guess for the flow. (It's not the value from previous iteration, for reasons that will be clear later, it's the first one. [Brox04] differs here, but it does a number of things differently in the numerics anyway.) This gives us: - ∂/∂x ( g(x, y) (u0 + du)_x ) - ∂/∂y ( g(x, y) (u0 + du)_y ) = 0 or - ∂/∂x ( g(x, y) du_x ) - ∂/∂y ( g(x, y) du_y ) = ∂/∂x ( g(x, y) u0_x ) + ∂/∂y ( g(x, y) u0_y ) where the right-hand side is effectively a constant for these purposes (although it still needs to be calculated anew for each iteration, since g(x, y) changes). Of course, now we have a different problem; all the other terms are formulated in terms of u and v, not du and dv. DeepFlow solves this by not searching for the flow between I_0 and I_1, but between I_0 and a pre-warped I_1. In other words, before any of the derivatives involving I_t are calculated, we calculate an I_w with bilinear interpolation: I_w(x, y) = I_1(x + u0(x, y), y + v0(x, y)) and then redefine I_t (occasionally called I_z) as I_t(x, y) = I_w(x, y) - I_0(x, y) Note that the plus sign effectively means inverting the flow, so if the u0 and v0 were already correctly estimated, perfectly smooth and linear everywhere, I_w = I_0. (All spatial derivatives are calculated on the mean between I_0 and I_w; the paper doesn't mention this.) After this, all the equations for E_I and E_G earlier will still hold, they will just be calculating du and dv instead. Note that this means we have three values for the flow; there's u0 for the initial guess, du for the current guess of delta from u0 (which makes u0 + du the current guess of the flow), and du' for the previous guess of delta from u0. (The initial values for du' and dv' will be zero.) Now back to our equations, as we look at practical implementation: - ∂/∂x ( g(x, y) du_x ) - ∂/∂y ( g(x, y) du_y ) = ∂/∂x ( g(x, y) u0_x ) + ∂/∂y ( g(x, y) u0_y ) - ∂/∂x ( g(x, y) dv_x ) - ∂/∂y ( g(x, y) dv_y ) = ∂/∂x ( g(x, y) v0_x ) + ∂/∂y ( g(x, y) v0_y ) We can discretize the left-hand and right-hand side identically (they differ only in signs and in variable), so let's look only at - ∂/∂x ( g(x, y) du_x ) - ∂/∂y ( g(x, y) du_y ) [Brox05] equation (2.14) (which refers to a 1998 book, although I couldn't immediately find the equation in question in that book) discretizes this as - 1/2 (g(x+1, y) + g(x, y)) (du(x+1, y) - du(x, y)) + 1/2 (g(x-1, y) + g(x, y)) (du(x, y) - du(x-1, y)) - 1/2 (g(x, y+1) + g(x, y)) (du(x, y+1) - du(x, y)) + 1/2 (g(x, y-1) + g(x, y)) (du(x, y) - du(x, y-1)) It also mentions that it would be better to sample g at the half-way points, e.g. g(x+0.5, y), but that begs the question exactly how we'd do that, and DeepFlow doesn't seem to care, so we stick with their version. Now we can finally let g use the values of the flow (note that this is the actual flow u and v, not du and dv!) from the previous iteration, as before: g(x, y) = 1 / sqrt(u'_x² + u'_y² + v'_x² + v'_y² + ε²) The single derivatives in g(x) are approximated by standard central differences (see https://en.wikipedia.org/wiki/Finite_difference_coefficient), e.g. u_x(x, y) = 1/2 (u(x + 1, y) - u(x - 1, y)) although the derivatives of I are using the fancier I_x(x, y) = 1/12 (-I(x - 2, y) + 8 I(x - 1, y) - 8 I(x - 1, y) + I(x - 2, y)) I assume this is because I_x derivatives are calculated only once, so we can afford more accurate derivatives (or possibly simply because of influence from earlier papers). Let's now define a smoothness constant between the neighbors (x,y) and (x1,y1): s(x1, y1) = 1/2 (g(x, y) + g(x1, y1)) Collecting all the du(x, y) terms of the discretized equation above, ignoring the right-hand side, which is just a constant for us anyway: - s(x+1, y) (du(x+1, y) - du(x, y)) + s(x-1, y) (du(x, y) - du(x-1, y)) - s(x, y+1) (du(x, y+1) - du(x, y)) + s(x, y-1) (du(x, y) - du(x, y-1)) = C - s(x+1, y) du(x+1, y) + s(x+1, y) du(x, y) + s(x-1, y) du(x, y) - s(x-1, y) du(x-1, y) - s(x, y+1) du(x, y+1) + s(x, y+1) du(x, y) + s(x, y-1) du(x, y) - s(x, y-1) du(x, y-1) = C (s(x+1, y) + s(x-1, y) + s(x, y+1) + s(x, y-1)) du(x, y) = s(x+1, y) du(x+1, y) + s(x-1, y) du(x-1, y) + s(x, y+1) du(x, y+1) + s(x, y-1) du(x, y-1) + C It is interesting to note that if s = 1 uniformly, which would be the case without our penalizer Ψ(a²), we would have the familiar discrete Laplacian, where du(x, y) would seek to simply become the average of its four immediate neighbors. Now our equation system is finally complete and linear, and the rest is fairly pedestrian. The last term connects all the unknowns together, but we still solve them mostly as 2x2 matrices. The most basic iterative method is Jacobi, where we solve du(x, y) and dv(x,y) using the previous iteration's value for all other du/dv values. (That this converges at all it beyond this text to prove, but it does. Not that we bother iterating until it converges; a few iterations is good enough.) Gauss-Seidel iterations improve on this in that (surprisingly!) using this iteration's computed du/dv values if they're ready; this improves convergence, but is hard to parallelize. Successive over-relaxation (SOR) improves further on this, in that it assumes that the solution moves towards the right value, so why not just go a bit further? That is, if Gauss-Seidel would tell you to increase the flow by 1.0 pixel to the right, perhaps go 1.5 pixels to the right instead (this value is called ω). Again, the convergence proof is beyond the scope here, but SOR converges for any ω between 1 and 2 (1 gives plain Gauss-Seidel, and over 2, we risk overshooting and never converging). Optimal ω depends on the equation system; DIS uses ω = 1.6, which presumably was measured, while we do ω = 1.8 (seems to be marginally better after some light testing). Efficient GPU implementation of SOR is not trivial; like noted before, Gauss-Seidel is inherently serial, which is a poor match for the GPU. Worse, doing SOR with Jacobi as base instead of Gauss-Seidel makes for an algorithm which simply does not converge. We solve this by using a method called red-black SOR (not to be confused with red-black binary trees). Conceptually, it assigns every unknown a color, with every other being red or black (similar to a checkerboard). Since red values now only depend on black values and vice versa, one can do all red values in parallel, then all black values, and so on. (This is equivalent to reordering the equation set; different such orderings can have different convergence speeds.) Our GPU SOR implementation is not overly efficient, so essentially one such half-iteration of red-black SOR costs the same as one full iteration of Jacobi but convergence is so much faster that it's worth it. Generally speaking, Gauss-Seidel converges twice as fast as Jacobi (ie., if Jacobi converges in N iterations, Gauss-Seidel does so in N/2), but SOR converges _geometrically_ faster, ie., in O(√N) iterations. Do note that the DeepFlow code does not fully use SOR or even Gauss-Seidel; it solves every 2x2 block (ie., single du/dv pair) using Cramer's rule, and then pushes that vector 60% further, SOR-style. This would be clearly more accurate if we didn't have SOR in the mix (since du and dv would converge immediately relative to each other, bar Cramer's numerical issues), but I'm not sure whether it's better given SOR. (DIS changes this to a more traditional SOR formulation, which we also use. It doesn't seem to be much different in practical testing; perhaps minutely worse, but I haven't done a deep analysis here.) And that's it. References: [Brox04]: Brox, Bruhn, Papenberg, Weickert: “High Accuracy Optical Flow Estimation Based on a Theory for Warping”, in Proceedings of the European Conference on Computer Vision (ECCV), 2004 [Brox05]: Brox: “From Pixels to Regions: Partial Differential Equations in Image Analysis”, PhD thesis, 2005 [Brox11]: Brox, Malik: “Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011 [Ciarlet78]: Ciarlet: “The Finite Element Method for Elliptic Problems”, 1978 [Fahad07]: Fahad, Morris: “Multiple Combined Constraints for Optical Flow Estimation”, in Proceedings of the 3rd International Conference on Advances in Visual Computing (ISVC), 2007 [Kroeger16]: Kroeger, Timofte, Dai, van Gool: “Fast Optical Flow using Dense Inverse Search”, in Proceedings of the European Conference on Computer Vision (ECCV), 2016 [Weinzaepfel13]: Weinzaepfel, Revaud, Harchaoui, Schmid: “DeepFlow: Large displacement optical flow with deep matching”, in IEEE International Conference on Computer Vision (ICCV), 2013 [Zimmer11]: Zimmer, Bruhn, Weickert: “Optic Flow in Harmony”, International Journal of Computer Vision, 2011