Pattern Buffet #5

crowsonkb · 2016-12-08T04:04:44+00:00

It's from when I was playing around with an interactive version of deepdream + style transfer with some seriously strange settings which I don't recall exactly. I found a few inputs that could make VGG19 generate 'scenes' instead of stylized weird images.

crowsonkb · 2016-12-08T04:01:03+00:00

Content image: https://raw.githubusercontent.com/jcjohnson/fast-neural-style/master/images/content/chicago.jpg (see how the towers are in the same places in this Chicago skyline photo and Dog Town, there's a waterfront/street thing in both, etc...)

Style image: https://github.com/jcjohnson/neural-style/blob/master/examples/inputs/escher_sphere.jpg

crowsonkb · 2016-11-24T17:06:13+00:00

This is version 2, the interactive one (and the one where I got L-BFGS to work). It's not tiled yet and is limited to 1400 px on the Amazon cloud machines it's running on.

crowsonkb · 2016-11-03T18:24:24+00:00

This output is from version 2 that I'm working on, but I think it could be reproduced on v1 by starting at a small size (I think this one was started around 200-220px) and specifying rising tiers of sizes i.e. -s 225 320 450 640 900 etc. Also, I used the equivalent option to -tp 1 for edge-preserving noise removal.

crowsonkb · 2016-10-28T15:08:01+00:00

The content image was this photograph of the Chicago skyline that's an example input in jcjohnson/fast-neural-style. The style was this Picasso painting, similarly an example style. Though I did this with my implementation of Neural Style, not his.

crowsonkb · 2016-10-26T14:36:27+00:00

That's part of what this post is about. :) L-BFGS, and the method I describe in it, are second-order methods (they estimate the Hessian of the cost function for the purpose of improving on gradient descent). Specifically, quasi-Newton methods based on a multidimensional analog of the secant method. These secant methods don't actually compute the Hessian, they estimate it from the past history of gradients and steps. This is very helpful since the Hessian is an nxn matrix (n = number of parameters, so millions...), so you can't compute it.

Everything currently widely used in practice right now for neural nets is first-order. But it's an ongoing area of research. The main problem has been adapting the second-order methods preferentially used for convex, smooth optimization to the nonconvex, potentially nonsmooth, stochastic cost functions used in neural networks. They used to not work at all; right now there are relatively new (last few years) methods that work but not significantly better than SGD or Adam.

crowsonkb · 2016-10-25T18:03:11+00:00

It does work out that there's a sqrt(g2) * phi * s term in there, but I don't see the relation of the formula as a whole to SQN's formula for y. The damping formula y_damped = (1-phi) * y + phi * s estimates the Hessian inverse as a linear combination of the ordinary L-BFGS Hessian inverse estimate and the identity matrix. Since you can consider ordinary gradient descent to use the identity matrix as its Hessian estimate, the effect is to pull the L-BFGS steps back slightly toward gradient descent steps. It shouldn't be confused with Powell damping which is y_damped = phi * y + (1-phi) * H * s where H * s is computed with another two-loop recursion and phi is determined dynamically each step to ensure H is positive definite.

What I don't know yet is what exactly the multiplication of y by a diagonal scaling matrix does to the Hessian estimate.

crowsonkb · 2016-10-25T15:09:27+00:00

Oh right! I forgot you could do that regardless of whether your default Anaconda environment is Python 2 or 3. Thanks!

crowsonkb · 2016-10-25T14:48:10+00:00

Oh! I remember why I was trying that weird thing with y and the scaling matrix in the first place. I started out using the Adagrad scaling matrix as the initial Hessian diagonal, like adaQN, but I found that for image synthesis the non-constant diagonal entries would become visible in the image as a noise pattern. I don't know why it does that since Adagrad itself (and Adam etc.) don't do that. So I found this alternate way of incorporating the scaling matrix into the inverse-Hessian estimate which didn't introduce image noise.

crowsonkb · 2016-10-25T14:35:15+00:00

The part that incorporates momentum:

self.g1 *= self.b1
s = -self.step_size * self.inv_hv(self.g1 + self.grad)

Without momentum, the call to self.inv_hv would be just self.inv_hv(self.grad).

self.g1 is my gradient first moment accumulator and self.b1 (from Adam's beta_1) is my momentum parameter (here 0.75 for image synthesis but for neural network training it was 0.9). self.inv_hv() is the two-loop recursion function. This is Nesterov momentum so the search direction is b1 * g1 + self.grad (previous step's gradient) instead of just g1. Nesterov momentum did better in practice on the image synthesis problem than classical momentum. Importantly, I still form y using the actual gradient, not g1. The scaling matrix is another thing altogether from momentum and is applied completely separately, later.

I applied the scaling matrix in a really weird way that I haven't heard of anyone else doing. First I form y as grad - self.grad (the gradient at the current params minus the previous gradient), apply damping, then multiply y by sqrt(self.g2). I noticed, when trying out different damping methods, that reducing the scale of y (leaving s alone) increased the effective learning rate, and vice versa. Further, this effect was per-parameter, meaning I could pick out particular image regions, multiply the corresponding region of y by a constant, and slow the learning rate down for those regions. So I divide y by the Adagrad scaling matrix to apply it. :) I found this to produce more stable behavior than using the scaling matrix as the initial Hessian diagonal in the two-loop recursion! I am not using any kind of step rejection criterion currently so it really has to never take unrecoverably bad steps even if it means going a bit slower.

That is, y is formed as follows:

y = grad - self.grad
y = (1-phi)*y + phi*s
y *= sqrt(self.g2)

phi is a constant, and 0.2 is about right for it. Everyone else I have seen apply damping except oLBFGS uses some sort of rule to dynamically choose phi based on s and y. But that was complicated and, when I tried Powell damping proper, less stable (but faster) than just picking a constant phi. Notably, oLBFGS uses a constant as well (they call it lambda).

self.g2 of course is Adagrad's gradient second moment accumulator. Have you heard at all of this sort of modification of y with a scaling matrix?

crowsonkb · 2016-10-25T14:00:09+00:00

Python 3 is not yet supported with the post-processing part of NumBBO/CoCO!

I keep forgetting Python 2 exists. ;_; Time to install Anaconda 2 I guess.

crowsonkb · 2016-10-21T23:55:17+00:00

style_transfer.py f91f49e187911e065de6b718b519ddfc.jpg f91f49e187911e065de6b718b519ddfc.jpg --content-layers conv4_1 conv4_2 conv4_3 -cw -0.05 -i 200 50 -s 64 96 128 192 256 384 512 768 1024 1536 --tile-size 1536 --model vgg19_partavgpool.prototxt

Content and style image: this fanart of Madeleine L'Engle's character Proginoskes.

This effect was created by (a) negative content_weight (b) rendering at a large number of scales starting with 64px (c) using average pooling for all pooling layers except the first one, which was max pooling.

crowsonkb · 2016-10-21T23:14:18+00:00

Oooh! :3

crowsonkb · 2016-10-14T05:49:19+00:00

I used the original ResNet-50, trained on ImageNet, from here: https://github.com/KaimingHe/deep-residual-networks

crowsonkb · 2016-10-13T22:16:49+00:00

I made a work-in-progress parameter usage page on the project's github wiki: https://github.com/crowsonkb/style_transfer/wiki/Parameter-usage

crowsonkb

TROPHY CASE

daily reddit gold goal