git.sesse.net Git - nageru/blob - futatabi/variational_refinement.txt

   1 Variational refinement -- an introduction and derivation
   2
   3 The variational refinement is probably the most difficult part of the
   4 algorithm to understand, in part because the description in most papers
   5 are very heavy on notation and rather light on exposition. I've tried
   6 to give a somewhat friendlier introduction to this specific algorithm
   7 below.
   8
   9 The general idea is fairly simple; we try to optimize the flow field
  10 as a whole, by minimizing some mathematical notion of badness expressed
  11 as an energy function. The one used in the dense inverse search paper
  12 [Kroeger16; se references below] has this form:
  13
  14   E(U) = int( σ Ψ(E_I) + γ Ψ(E_G) + α Ψ(E_S) ) dx
  15
  16 where Ψ(a²) = sqrt(a² + ε²) for some small constant ε = 0.001, and
  17 σ, γ, α are empirically set weighting constants. (We'll get to what the
  18 different enery terms are in a minute.) The integral is, for all practical
  19 purposes, just a sum over all the pixels in the flow.
  20
  21 In general, such formulas are nonconvex and highly nonlinear, so we
  22 cannot hope to find a global minimum -- but if we start from the flow
  23 generated by the motion search, we can at least hope to make it somehow
  24 better by walking towards a local minimum. (In fact, there are many
  25 methods for optical flow that work _only_ by such minimization,
  26 so the word “refinement” is maybe not doing the method justice.
  27 One could just as well say that the motion search is a way of
  28 finding a reasonable starting point for the optimization.)
  29
  30 The dense inverse search paper [Kroeger16] sets
  31 up the energy terms as described by some motion tensors and normalizations,
  32 then says simply that it is optimized by “θ_vo fixed point iterations
  33 and θ_vi iterations of Successive Over Relaxation (SOR) for the linear
  34 system”. It's not immediately obvious what this means, but it gives
  35 a reference to [Brox04]. However, that paper describes a numerical
  36 approximation scheme that is _far_ more complicated than what the DIS
  37 code actually does.
  38
  39 Rather, one must look at the other main reference they are giving,
  40 which is [Weinzaepfel13], describing a system called DeepFlow.
  41 DIS borrows most of the exposition and code for its variational
  42 refinement from DeepFlow, just removing some terms and fixing up
  43 a few issues here and there. (There are some slight differences in
  44 the paper, like the use of ∂z instead of ∂t, but that looks mostly
  45 like an error to me.) Unfortunately, that paper in turn refers to
  46 [Brox11], which appears no more useful in clearing up the notation
  47 to me.
  48
  49 However, digging down in the references, finally one finds [Zimmer11],
  50 which is where the tensor notation appears to come from. This allows
  51 us to look at the first term in the energy, E_I, which comes from the
  52 intensity constant assumption. The basic idea is optical flow nearly
  53 by definition should preserve intensity after the warp:
  54
  55   I_0(x + u) = I_1(x)
  56
  57 where I_0 is the first picture, I_1 is the second, x is any 2D
  58 coordinate and u is the flow at x (which we are optimizing over).
  59 In general, we'll be optimizing over the entire field of u
  60 (potentially hundreds of thousands of values), but we'll be looking
  61 mostly at individual points, so we'll skip the coordinates when we
  62 can (e.g. we write u instead of or u(x, y)). u is of course the 2D
  63 flow, although often, we'll write its components separately as u and v
  64 instead of as a vector u.
  65
  66 Before we go further, we need to add some more notation:
  67
  68   * I_x is the partial derivative of I with respect to x (at some
  69     point), and similarly for I_y. These do not depend on u,
  70     so they can be precalculated before the optimization.
  71   * I_xx is the double partial derivative of I, and similar for
  72     I_yy and I_xy (the latter is the same as I_yx).
  73   * I_t is the temporal derivative of I, ie. in practice just
  74     I_t(x) = I_1(x) - I_0(x).
  75
  76 Returning now to our original assertion:
  77
  78   I_0(x + u) = I_1(x)
  79
  80 Classically in optical flow, one assumes that the flow is smooth
  81 and linear around the point x, which allows one to approximate this
  82 equation by
  83
  84   I_x u + I_y v + I_t = 0
  85
  86 This is usually simply called “the optical flow constraint”,
  87 and gives rise to a very natural part of the energy:
  88
  89   E_I = I_x u + I_y v + I_t
  90
  91 Remember that we send E_I through the function Ψ(a²) = sqrt(a² + ε²),
  92 so clearly Ψ(E_I) will be minimized if indeed E_I is zero.
  93
  94 At this point, many papers start talking about Euler-Lagrange
  95 multivariate equations, which is a fairly daunting concept
  96 (at least the Wikipedia page is suitable for scaring small children).
  97 However, for the first two terms, we don't need its general form,
  98 and it reduces to something much simpler; just differentiate the energy
  99 by u and equate the result to zero (finding some minimum; it can't be
 100 a maximum, since *wave hands intensely*). Then differentiate the energy
 101 by v and set that to zero, too; now you have two equations in two
 102 unknowns (or, since we're optimizing over a field, maybe 500k
 103 equations in 500k unknowns -- although the equation set will be
 104 very sparse), which is hopefully solvable using linear methods.
 105 We'll look at what this gives for E_I in a moment, then try to apply
 106 the same notions to E_G and E_S later.
 107
 108 First we modify E_I a bit by adding some normalization:
 109
 110   E_I = β_0 (I_x u + I_y v + I_t)
 111
 112 where β_0 = 1/(abs(∇I)² + 0.01). Note that β_0 depends on I only,
 113 so for the purposes of optimizing u, it's a constant and can be
 114 precomputed across I. (β_0 will, of course, depend on x, but so
 115 do all the other terms in the equation.)
 116
 117 Now we give it to Maple, differentiating first by u and then by v:
 118
 119 > M := (u,v) -> B_0 * (I_x * u + I_y * v + I_t);
 120                    M := (u, v) -> B_0 (I_x u + I_y v + I_t)
 121
 122 > diff(sqrt(M(u,v)^2 + e), u);
 123                            2
 124                         B_0  (I_x u + I_y v + I_t) I_x
 125                      ------------------------------------
 126                          2                      2     1/2
 127                      (B_0  (I_x u + I_y v + I_t)  + e)
 128
 129 > diff(sqrt(M(u,v)^2 + e), v);
 130                            2
 131                         B_0  (I_x u + I_y v + I_t) I_y
 132                      ------------------------------------
 133                          2                      2     1/2
 134                      (B_0  (I_x u + I_y v + I_t)  + e)
 135
 136
 137 So these are the two expressions to be set to zero (for each
 138 point). We'll notice immediately that this isn't very linear
 139 in u and v, so here's where the “fixed point iterations” come in;
 140 we simply assume that our previous values for u and v are
 141 approximately good enough for the denominator, and optimize
 142 them in the numerator only. Then we get new values that are
 143 hopefully a bit closer, which we can then use for the
 144 denominator, and so on. (This is seemingly an old technique;
 145 [Brox05] cites [Ciarlet78]. It is justifiable in the sense
 146 that the only thing really held constant is the derivative
 147 of the penalizer.) In other words, if we define the constant
 148
 149   k1 = β_0² / sqrt(β_0² (I_x u' + I_y v' + I_t)² + ε²)
 150
 151 (where u' and v' are the guesses for u and v from the previous
 152 iteration)
 153
 154 we have the much more manageable
 155
 156   k1 I_x²    u + k1 I_x I_y v = - k1 I_t I_x
 157   k1 I_x I_y u + k1 I_y²    v = - k1 I_t I_y
 158
 159 ie., two linear equations in u and v. Now, you will notice two
 160 immediate problems by this equation set:
 161
 162   * The factor k1 is completely useless, since it's just multiplied
 163     in everywhere.
 164   * The set of equations is colinear (the determinant of the matrix
 165     is zero), and thus there is an infinite number of possible
 166     solutions—this is known as the so-called “aperture problem”.
 167     It shouldn't be surprising, though, as we cannot expect that
 168     starting with a single constraint should allow us to solve
 169     for two unknowns.
 170
 171 However, both problems will go away as soon as we start adding
 172 more terms, so let's look at the gradient constancy term E_G next.
 173 It is fairly similar to the brightness constancy term, except it
 174 uses the (spatial) gradient instead of intensity:
 175
 176   ∇I_0(x + u) = ∇I_1(x)
 177
 178 or equivalently (by definition):
 179
 180   (∂I/∂x)_0(x + u) = (∂I/∂x)_1(x)
 181   (∂I/∂y)_0(x + u) = (∂I/∂y)_1(x)
 182
 183 The idea is that this is more robust to changes in lighting.
 184 It doesn't replace the intensity term, but augments it; the weighting
 185 constants σ and γ control their relative importance. Also note that
 186 this actually gives us two independent equations, unlike the brightness
 187 constancy term.
 188
 189 However, it is not obvious at all how to discretize this. In particular,
 190 most papers, including [Brox04], appear to want _not_ to make any linear
 191 assumptions of the flow in this case, and end up with tons of terms.
 192 (The DIS and DeepFlow papers do, again, use some tensor notation that
 193 I do not understand, but I'm not convinced it actually contains any
 194 of the discretization.)
 195
 196 Yet more paper searching eventually turns up [Fahad07], which simply
 197 states that the discretized versions of these equations are:
 198
 199   I_xx u + I_xy v + I_xt = 0
 200   I_yx u + I_yy v + I_yt = 0.
 201
 202 which seems to match well what the DIS code uses. Note that even though
 203 this is an equation set equal to zero, we can't just solve for them;
 204 we need to make (penalized, normalized) energy terms and add them to
 205 the other terms. This gives
 206
 207   E_G = β_x (I_xx u + I_xy v + I_xt) + β_y (I_yx u + I_yy v + I_yt)
 208
 209 with normalization terms
 210
 211   β_x = 1 / (abs(∇(I_x))² + 0.01)  (∇(I_x) is the gradient of ∂I/∂x)
 212   β_y = 1 / (abs(∇(I_y))² + 0.01)
 213
 214 (The DIS paper writes ∇I_dx and ∇I_dy instead of ∇I_x and ∇I_y, but I believe
 215 that's a typo; the DeepFlow paper says ∇I_x and ∇I_y.)
 216
 217 The papers both write that Ψ(E_G) is used, which would mean that the penalized
 218 term is
 219
 220   E_G = sqrt((β_x (I_xx u + I_xy v + I_xt) + β_y (I_yx u + I_yy v + I_yt))² + ε²)
 221
 222 but that isn't what the code actually does. Instead, it seems that the two
 223 terms are squared independently:
 224
 225   E_G = sqrt((β_x (I_xx u + I_xy v + I_xt))² + (β_y (I_yx u + I_yy v + I_yt))² + ε²)
 226
 227 Both are solvable just fine, and it probably does not matter all that much
 228 which we use in practice (although [Zimmer11] suggests that if we are using
 229 multichannel images, we should penalize the three channels separately),
 230 but we follow what the code actually does here.
 231
 232 We can differentiate them and equate them to zero as before:
 233
 234 > M_x := (u,v) -> B_x * (I_xx * u + I_xy * v + I_xt);
 235                       M_x := (u, v) -> B_x (I_xx u + I_xy v + I_xt)
 236
 237 > M_y := (u,v) -> B_y * (I_xy * u + I_yy * v + I_yt);
 238                       M_y := (u, v) -> B_y (I_xy u + I_yy v + I_yt)
 239
 240 > diff(sqrt(M_x(u,v)^2 + M_y(u,v)^2 + e), u);
 241                                      2             2
 242        2 (I_xx u + I_xy v + I_xt) B_x  I_xx + 2 B_y  (I_xy u + I_yy v + I_yt) I_xy
 243        ---------------------------------------------------------------------------
 244                                   2    2      2                         2     1/2
 245        2 ((I_xx u + I_xy v + I_xt)  B_x  + B_y  (I_xy u + I_yy v + I_yt)  + e)
 246
 247 > diff(sqrt(M_x(u,v)^2 + M_y(u,v)^2 + e), v);
 248                                      2             2
 249        2 (I_xx u + I_xy v + I_xt) B_x  I_xy + 2 B_y  (I_xy u + I_yy v + I_yt) I_yy
 250        ---------------------------------------------------------------------------
 251                                   2    2      2                         2     1/2
 252        2 ((I_xx u + I_xy v + I_xt)  B_x  + B_y  (I_xy u + I_yy v + I_yt)  + e)
 253
 254 Using the same fixed-point scheme where we hold the terms in the
 255 denominator constant and equal to last iteration's values, we get
 256 a new common constant
 257
 258   k2 = 1 / sqrt(β_x² (I_xx u' + I_xy v' + I_xt)² + β_y² (I_xy u' + I_yy v' + I_yt)²)
 259
 260 and for brevity
 261
 262   k_x = k2 β_x²
 263   k_y = k2 β_y²
 264
 265 and thus, collecting terms for u and v, we get the two equations:
 266
 267   (k_x I_xx² + k_y I_xy²)         u + (k_x I_xx I_xy + k_y I_xy I_yy) v = - k_x I_xx I_xt - k_y I_xy I_yt
 268   (k_x I_xx I_xy + k_y I_xy I_yy) u + (k_x I_xy² + k_y I_yy²)         v = - k_x I_xy I_xt - k_y I_yy I_yt
 269
 270 which is linear in u and v, not colinear (unless we are extremely
 271 unlucky), and can be easily solved.
 272
 273 Of course, for optimizing the weighted sum σ Ψ(E_I) + γ Ψ(E_G),
 274 we just add the two equation sets pairwise with appropriate weights.
 275
 276 There's a small discrepancy here: The equations suggest that we should
 277 be be squaring the normalization terms β_0², β_x², β_y²; however, the
 278 code does not appear to do so. It's possible that they were intended to be
 279 added outside of the penalization, e.g. Ψ(a²) = sqrt(β a² + ε²), but given
 280 that these come from [Zimmer11], which mentions nothing of the sort,
 281 I'll just have to assume that this is an implementation mishap.
 282
 283 The final smoothness term the one that binds the flow field together as a whole
 284 so that we don't have WxH completely independent equations (with its positive
 285 and negative sides, of course). It is the simplest in terms of notation,
 286 but it requires the full power of the Euler-Lagrange equations to minimize,
 287 so we'll need to figure that part out.
 288
 289   E_S = abs(∇u)² + abs(∇v)²
 290
 291 or
 292
 293   E_S = (u_x² + u_y²) + (v_x² + v_y²)
 294
 295 The penalized form used in the DeepFlow notation, contrary to what you'd expect
 296 from the paper, is:
 297
 298   E_S = sqrt(u_x² + u_y² + v_x² + v_y² + ε²)
 299
 300 How would one go about to minimize such an expression by u? (We'll deal with v
 301 later.) It's perhaps no big surprise that the expression involves double
 302 derivatives, but the full form involves the Euler-Lagrange equations.
 303 They allow us to minimize expressions that contain x, y, u(x, y) _and_ the partial
 304 derivatives u_x(x, y) and u_y(x, y), although the answer becomes a differential
 305 equation.
 306
 307 The Wikipedia page is, unfortunately, not very beginner-friendly,
 308 but the general idea is: Differentiate the expression by u_x
 309 (yes, differentiating by a partial derivative!), negate it, and then
 310 differentiate the result by x. Then do the same thing by u_y and y,
 311 add the two results together and equate to zero. Mathematically
 312 (https://en.wikipedia.org/wiki/Euler%E2%80%93Lagrange_equation#Several_functions_of_several_variables_with_single_derivative):
 313
 314   ∂E/∂u - ∂/∂x (∂E/∂u_x) - ∂/∂y (∂E/∂u_y) = 0
 315
 316 The first term disappears, since we don't have a non-differentiated
 317 u(x, y) in E_S. (Previously, the two _other_ terms would disappear,
 318 because we didn't have u_x or u_y in E_I or E_G.) This means we get
 319
 320   - ∂/∂x (u_x / sqrt(u_x² + u_y² + v_x² + v_y² + ε²)) - ∂/∂y (u_y / sqrt(u_x² + u_y² + v_x² + v_y² + ε²)) = 0
 321
 322 (We don't remove the minus signs since this is supposed to be added to
 323 all the other terms.)
 324
 325 This is what's called an _anisotropic diffusion_ (or Perona–Malik diffusion)
 326 equation, and is extensively described in literature. It has the effect of
 327 smoothing the flow more in some places than others; in particular, it does
 328 not smooth as strongly near edges, so it is edge-preserving. (It's a bit odd to
 329 call it anisotropic, since it does smooth equally in all directions;
 330 [Brox05] calls it vector-valued diffusion.)
 331
 332 We'd love to our usual trick of keeping the nonlinear terms in the denominator
 333 constant, but alas, we can't do that yet, since it's under the differentiation
 334 operator; this factor has to be discretized together with u before we can treat
 335 it as a constant. So instead, we'll define it as a function (called the
 336 _diffusivity_ at the given point):
 337
 338   g(x, y) = 1 / sqrt(u_x² + u_y² + v_x² + v_y² + ε²) = 0
 339
 340 which gives us
 341
 342   - ∂/∂x ( g(x, y) u_x ) - ∂/∂y ( g(x, y) u_y ) = 0
 343
 344 We'll also have a similar equation for minimizing v, of course:
 345
 346   - ∂/∂x ( g(x, y) v_x ) - ∂/∂y ( g(x, y) v_y ) = 0
 347
 348 There's no normalization term β here, unlike the other terms; DeepFlow2
 349 adds one, but we're not including it here.
 350
 351 At this point, we make a tweak. This seemingly goes back to at least
 352 [Brox04], which also makes the same tweak to all the other terms
 353 (which we don't, but see below). We split u (and v) into something
 354 based on the original value plus a differential du (and dv), and then
 355 solve for du (or dv) instead. (In math-speak, we are moving to an
 356 implicit method, which is often more numerically stable.) In other words,
 357
 358   u(x, y) = u0(x, y) + du(x, y)
 359
 360 where u0(x, y) is the initial guess for the flow. (It's not the value
 361 from previous iteration, for reasons that will be clear later, it's
 362 the first one. [Brox04] differs here, but it does a number of things
 363 differently in the numerics anyway.)
 364
 365 This gives us:
 366
 367   - ∂/∂x ( g(x, y) (u0 + du)_x ) - ∂/∂y ( g(x, y) (u0 + du)_y ) = 0
 368
 369 or
 370
 371   - ∂/∂x ( g(x, y) du_x ) - ∂/∂y ( g(x, y) du_y ) = ∂/∂x ( g(x, y) u0_x ) + ∂/∂y ( g(x, y) u0_y )
 372
 373 where the right-hand side is effectively a constant for these purposes
 374 (although it still needs to be calculated anew for each iteration,
 375 since g(x, y) changes).
 376
 377 Of course, now we have a different problem; all the other terms are
 378 formulated in terms of u and v, not du and dv. DeepFlow solves this
 379 by not searching for the flow between I_0 and I_1, but between I_0 and
 380 a pre-warped I_1. In other words, before any of the derivatives involving
 381 I_t are calculated, we calculate an I_w with bilinear interpolation:
 382
 383   I_w(x, y) = I_1(x + u0(x, y), y + v0(x, y))
 384
 385 and then redefine I_t (occasionally called I_z) as
 386
 387   I_t(x, y) = I_w(x, y) - I_0(x, y)
 388
 389 Note that the plus sign effectively means inverting the flow, so if
 390 the u0 and v0 were already correctly estimated, perfectly smooth and linear
 391 everywhere, I_w = I_0. (All spatial derivatives are calculated on the mean
 392 between I_0 and I_w; the paper doesn't mention this.) After this, all the
 393 equations for E_I and E_G earlier will still hold, they will just be
 394 calculating du and dv instead. Note that this means we have three values
 395 for the flow; there's u0 for the initial guess, du for the current guess
 396 of delta from u0 (which makes u0 + du the current guess of the flow),
 397 and du' for the previous guess of delta from u0. (The initial values for
 398 du' and dv' will be zero.)
 399
 400 Now back to our equations, as we look at practical implementation:
 401
 402   - ∂/∂x ( g(x, y) du_x ) - ∂/∂y ( g(x, y) du_y ) = ∂/∂x ( g(x, y) u0_x ) + ∂/∂y ( g(x, y) u0_y )
 403   - ∂/∂x ( g(x, y) dv_x ) - ∂/∂y ( g(x, y) dv_y ) = ∂/∂x ( g(x, y) v0_x ) + ∂/∂y ( g(x, y) v0_y )
 404
 405 We can discretize the left-hand and right-hand side identically (they differ
 406 only in signs and in variable), so let's look only at
 407
 408   - ∂/∂x ( g(x, y) du_x ) - ∂/∂y ( g(x, y) du_y )
 409
 410 [Brox05] equation (2.14) (which refers to a 1998 book, although I couldn't
 411 immediately find the equation in question in that book) discretizes this as
 412
 413   - 1/2 (g(x+1, y) + g(x, y)) (du(x+1, y) - du(x, y))
 414   + 1/2 (g(x-1, y) + g(x, y)) (du(x, y) - du(x-1, y))
 415   - 1/2 (g(x, y+1) + g(x, y)) (du(x, y+1) - du(x, y))
 416   + 1/2 (g(x, y-1) + g(x, y)) (du(x, y) - du(x, y-1))
 417
 418 It also mentions that it would be better to sample g at the half-way points,
 419 e.g. g(x+0.5, y), but that begs the question exactly how we'd do that, and
 420 DeepFlow doesn't seem to care, so we stick with their version.
 421
 422 Now we can finally let g use the values of the flow (note that this is the
 423 actual flow u and v, not du and dv!) from the previous iteration, as before:
 424
 425   g(x, y) = 1 / sqrt(u'_x² + u'_y² + v'_x² + v'_y² + ε²)
 426
 427 The single derivatives in g(x) are approximated by standard central differences
 428 (see https://en.wikipedia.org/wiki/Finite_difference_coefficient), e.g.
 429
 430   u_x(x, y) = 1/2 (u(x + 1, y) - u(x - 1, y))
 431
 432 although the derivatives of I are using the fancier
 433
 434   I_x(x, y) = 1/12 (-I(x - 2, y) + 8 I(x - 1, y) - 8 I(x - 1, y) + I(x - 2, y))
 435
 436 I assume this is because I_x derivatives are calculated only once, so we can
 437 afford more accurate derivatives (or possibly simply because of influence
 438 from earlier papers).
 439
 440 Let's now define a smoothness constant between the neighbors (x,y) and (x1,y1):
 441
 442   s(x1, y1) = 1/2 (g(x, y) + g(x1, y1))
 443
 444 Collecting all the du(x, y) terms of the discretized equation above,
 445 ignoring the right-hand side, which is just a constant for us anyway:
 446
 447   - s(x+1, y) (du(x+1, y) - du(x, y))
 448   + s(x-1, y) (du(x, y) - du(x-1, y))
 449   - s(x, y+1) (du(x, y+1) - du(x, y))
 450   + s(x, y-1) (du(x, y) - du(x, y-1)) = C
 451
 452   - s(x+1, y) du(x+1, y) + s(x+1, y) du(x, y)
 453   + s(x-1, y) du(x, y) - s(x-1, y) du(x-1, y)
 454   - s(x, y+1) du(x, y+1) + s(x, y+1) du(x, y)
 455   + s(x, y-1) du(x, y) - s(x, y-1) du(x, y-1) = C
 456
 457   (s(x+1, y) + s(x-1, y) + s(x, y+1) + s(x, y-1)) du(x, y) =
 458   s(x+1, y) du(x+1, y) + s(x-1, y) du(x-1, y) + s(x, y+1) du(x, y+1) + s(x, y-1) du(x, y-1) + C
 459
 460 It is interesting to note that if s = 1 uniformly, which would be the case
 461 without our penalizer Ψ(a²), we would have the familiar discrete Laplacian,
 462 where du(x, y) would seek to simply become the average of its four immediate
 463 neighbors.
 464
 465 Now our equation system is finally complete and linear, and the rest is
 466 fairly pedestrian. The last term connects all the unknowns together,
 467 but we still solve them mostly as 2x2 matrices. The most basic iterative
 468 method is Jacobi, where we solve du(x, y) and dv(x,y) using the
 469 previous iteration's value for all other du/dv values. (That this converges
 470 at all it beyond this text to prove, but it does. Not that we bother
 471 iterating until it converges; a few iterations is good enough.)
 472 Gauss-Seidel iterations improve on this in that (surprisingly!) using this
 473 iteration's computed du/dv values if they're ready; this improves convergence,
 474 but is hard to parallelize.
 475
 476 Successive over-relaxation (SOR) improves further on this, in that it
 477 assumes that the solution moves towards the right value, so why not
 478 just go a bit further? That is, if Gauss-Seidel would tell you to increase
 479 the flow by 1.0 pixel to the right, perhaps go 1.5 pixels to the right
 480 instead (this value is called ω). Again, the convergence proof is beyond the
 481 scope here, but SOR converges for any ω between 1 and 2 (1 gives plain
 482 Gauss-Seidel, and over 2, we risk overshooting and never converging). Optimal
 483 ω depends on the equation system; DIS uses ω = 1.6, which presumably was
 484 measured, while we do ω = 1.8 (seems to be marginally better after some
 485 light testing).
 486
 487 Efficient GPU implementation of SOR is not trivial; like noted before,
 488 Gauss-Seidel is inherently serial, which is a poor match for the GPU.
 489 Worse, doing SOR with Jacobi as base instead of Gauss-Seidel makes for
 490 an algorithm which simply does not converge. We solve this by using a
 491 method called red-black SOR (not to be confused with red-black binary
 492 trees). Conceptually, it assigns every unknown a color, with every other
 493 being red or black (similar to a checkerboard). Since red values now
 494 only depend on black values and vice versa, one can do all red values
 495 in parallel, then all black values, and so on. (This is equivalent to
 496 reordering the equation set; different such orderings can have different
 497 convergence speeds.)
 498
 499 Our GPU SOR implementation is not overly efficient, so essentially one such
 500 half-iteration of red-black SOR costs the same as one full iteration of
 501 Jacobi but convergence is so much faster that it's worth it. Generally
 502 speaking, Gauss-Seidel converges twice as fast as Jacobi (ie., if Jacobi
 503 converges in N iterations, Gauss-Seidel does so in N/2), but SOR converges
 504 _geometrically_ faster, ie., in O(√N) iterations.
 505
 506 Do note that the DeepFlow code does not fully use SOR or even Gauss-Seidel;
 507 it solves every 2x2 block (ie., single du/dv pair) using Cramer's rule,
 508 and then pushes that vector 60% further, SOR-style. This would be clearly
 509 more accurate if we didn't have SOR in the mix (since du and dv would
 510 converge immediately relative to each other, bar Cramer's numerical issues),
 511 but I'm not sure whether it's better given SOR. (DIS changes this to a more
 512 traditional SOR formulation, which we also use. It doesn't seem to be much
 513 different in practical testing; perhaps minutely worse, but I haven't done
 514 a deep analysis here.)
 515
 516 And that's it. References:
 517
 518 [Brox04]: Brox, Bruhn, Papenberg, Weickert: “High Accuracy Optical Flow
 519   Estimation Based on a Theory for Warping”, in Proceedings of the European
 520   Conference on Computer Vision (ECCV), 2004
 521 [Brox05]: Brox: “From Pixels to Regions: Partial Differential Equations in
 522   Image Analysis”, PhD thesis, 2005
 523 [Brox11]: Brox, Malik: “Large Displacement Optical Flow: Descriptor Matching in
 524   Variational Motion Estimation”, IEEE Transactions on Pattern Analysis and
 525   Machine Intelligence, 2011
 526 [Ciarlet78]: Ciarlet: “The Finite Element Method for Elliptic Problems”, 1978
 527 [Fahad07]: Fahad, Morris: “Multiple Combined Constraints for Optical Flow
 528   Estimation”, in Proceedings of the 3rd International Conference on Advances
 529   in Visual Computing (ISVC), 2007
 530 [Kroeger16]: Kroeger, Timofte, Dai, van Gool: “Fast Optical Flow using Dense
 531   Inverse Search”, in Proceedings of the European Conference on Computer Vision
 532   (ECCV), 2016
 533 [Weinzaepfel13]: Weinzaepfel, Revaud, Harchaoui, Schmid: “DeepFlow: Large
 534   displacement optical flow with deep matching”, in IEEE International Conference
 535   on Computer Vision (ICCV), 2013
 536 [Zimmer11]: Zimmer, Bruhn, Weickert: “Optic Flow in Harmony”, International
 537   Journal of Computer Vision, 2011