Theory of Computing Blog Aggregator

“Information is physical.”

This slogan seems to have originated around 1991 with Rolf Landauer.  It’s ricocheted around quantum information for the entire time I’ve been in the field, incanted in funding agency reports and popular articles and at the beginnings and ends of talks.

But what the hell does it mean?

There are many things it’s taken to mean, in my experience, that don’t make a lot of sense when you think about them—or else they’re vacuously true, or purely a matter of perspective, or not faithful readings of the slogan’s words.

For example, some people seem to use the slogan to mean something more like its converse: “physics is informational.”  That is, the laws of physics are ultimately not about mass or energy or pressure, but about bits and computations on them.  As I’ve often said, my problem with that view is less its audacity than its timidity!  It’s like, what would the universe have to do in order not to be informational in this sense?  “Information” is just a name we give to whatever picks out one element from a set of possibilities, with the “amount” of information given by the log of the set’s cardinality (and with suitable generalizations to infinite sets, nonuniform probability distributions, yadda yadda).  So, as long as the laws of physics take the form of telling us that some observations or configurations of the world are possible and others are not, or of giving us probabilities for each configuration, no duh they’re about information!

Other people use “information is physical” to pour scorn on the idea that “information” could mean anything without some actual physical instantiation of the abstract 0’s and 1’s, such as voltage differences in a loop of wire.  Here I certainly agree with the tautology that in order to exist physically—that is, be embodied in the physical world—a piece of information (like a song, video, or computer program) does need to be embodied in the physical world.  But my inner Platonist slumps in his armchair when people go on to assert that, for example, it’s meaningless to discuss the first prime number larger than 1010^125, because according to post-1998 cosmology, one couldn’t fit its digits inside the observable universe.

If the cosmologists revise their models next week, will this prime suddenly burst into existence, with all the mathematical properties that one could’ve predicted for it on general grounds—only to fade back into the netherworld if the cosmologists revise their models again?  Why would anyone want to use language in such a tortured way?

Yes, brains, computers, yellow books, and so on that encode mathematical knowledge comprise only a tiny sliver of the physical world.  But it’s equally true that the physical world we observe comprises only a tiny sliver of mathematical possibility-space.

Still other people use “information is physical” simply to express their enthusiasm for the modern merger of physical and information sciences, as exemplified by quantum computing.  Far be it from me to temper that enthusiasm: rock on, dudes!

Yet others use “information is physical” to mean that the rules governing information processing and transmission in the physical world aren’t knowable a priori, but can only be learned from physics.  This is clearest in the case of quantum information, which has its own internal logic that generalizes the logic of classical information.  But in some sense, we didn’t need quantum mechanics to tell us this!  Of course the laws of physics have ultimate jurisdiction over whatever occurs in the physical world, information processing included.

My biggest beef, with all these unpackings of the “information is physical” slogan, is that none of them really engage with any of the deep truths that we’ve learned about physics.  That is, we could’ve had more-or-less the same debates about any of them, even in a hypothetical world where the laws of physics were completely different.


So then what should we mean by “information is physical”?  In the rest of this post, I’d like to propose an answer to that question.

We get closer to the meat of the slogan if we consider some actual physical phenomena, say in quantum mechanics.  The double-slit experiment will do fine.

Recall: you shoot photons, one by one, at a screen with two slits, then examine the probability distribution over where the photons end up on a second screen.  You ask: does that distribution contain alternating “light” and “dark” regions, the signature of interference between positive and negative amplitudes?  And the answer, predicted by the math and confirmed by experiment, is: yes, but only if the information about which slit the photon went through failed to get recorded anywhere else in the universe, other than the photon location itself.

Here a skeptic interjects: but that has to be wrong!  The criterion for where a physical particle lands on a physical screen can’t possibly depend on anything as airy as whether “information” got “recorded” or not.  For what counts as “information,” anyway?  As an extreme example: what if God, unbeknownst to us mortals, took divine note of which slit the photon went through?  Would that destroy the interference pattern?  If so, then every time we do the experiment, are we collecting data about the existence or nonexistence of an all-knowing God?

It seems to me that the answer is: insofar as the mind of God can be modeled as a tensor factor in Hilbert space, yes, we are.  And crucially, if quantum mechanics is universally true, then the mind of God would have to be such a tensor factor, in order for its state to play any role in the prediction of observed phenomena.

To say this another way: it’s obvious and unexceptionable that, by observing a physical system, you can often learn something about what information must be in it.  For example, you need never have heard of DNA to deduce that chickens must somehow contain information about making more chickens.  What’s much more surprising is that, in quantum mechanics, you can often deduce things about what information can’t be present, anywhere in the physical world—because if such information existed, even a billion light-years away, it would necessarily have a physical effect that you don’t see.

Another famous example here concerns identical particles.  You may have heard the slogan that “if you’ve seen one electron, you’ve seen them all”: that is, apart from position, momentum, and spin, every two electrons have exactly the same mass, same charge, same every other property, including even any properties yet to be discovered.  Again the skeptic interjects: but that has to be wrong.  Logically, you could only ever confirm that two electrons were different, by observing a difference in their behavior.  Even if the electrons had behaved identically for a billion years, you couldn’t rule out the possibility that they were actually different, for example because of tiny nametags (“Hi, I’m Emily the Electron!” “Hi, I’m Ernie!”) that had no effect on any experiment you’d thought to perform, but were visible to God.

You can probably guess where this is going.  Quantum mechanics says that, no, you can verify that two particles are perfectly identical by doing an experiment where you swap them and see what happens.  If the particles are identical in all respects, then you’ll see quantum interference between the swapped and un-swapped states.  If they aren’t, you won’t.  The kind of interference you’ll see is different for fermions (like electrons) than for bosons (like photons), but the basic principle is the same in both cases.  Once again, quantum mechanics lets you verify that a specific type of information—in this case, information that distinguishes one particle from another—was not present anywhere in the physical world, because if it were, it would’ve destroyed an interference effect that you in fact saw.

This, I think, already provides a meatier sense in which “information is physical” than any of the senses discussed previously.


But we haven’t gotten to the filet mignon yet.  The late, great Jacob Bekenstein will forever be associated with the discovery that information, wherever and whenever it occurs in the physical world, takes up a minimum amount of space.  The most precise form of this statement, called the covariant entropy bound, was worked out in detail by Raphael Bousso.  Here I’ll be discussing a looser version of the bound, which holds in “non-pathological” cases, and which states that a bounded physical system can store at most A/(4 ln 2) bits of information, where A is the area in Planck units of any surface that encloses the system—so, about 1069 bits per square meter.  (Actually it’s 1069 qubits per square meter, but because of Holevo’s theorem, an upper bound on the number of qubits is also an upper bound on the number of classical bits that can be reliably stored in a system and then retrieved later.)

You might have heard of the famous way Nature enforces this bound.  Namely, if you tried to create a hard drive that stored more than 1069 bits per square meter of surface area, the hard drive would necessarily collapse to a black hole.  And from that point on, the information storage capacity would scale “only” with the area of the black hole’s event horizon—a black hole itself being the densest possible hard drive allowed by physics.

Let’s hear once more from our skeptic.  “Nonsense!  Matter can take up space.  Energy can take up space.  But information?  Bah!  That’s just a category mistake.  For a proof, suppose God took one of your black holes, with a 1-square-meter event horizon, which already had its supposed maximum of ~1069 bits of information.  And suppose She then created a bunch of new fundamental fields, which didn’t interact with gravity, electromagnetism, or any of the other fields that we know from observation, but which had the effect of encoding 10300 new bits in the region of the black hole.  Presto!  An unlimited amount of additional information, exactly where Bekenstein said it couldn’t exist.”

We’d like to pinpoint what’s wrong with the skeptic’s argument—and do so in a self-contained, non-question-begging way, a way that doesn’t pull any rabbits out of hats, other than the general principles of relativity and quantum mechanics.  I was confused myself about how to do this, until a month ago, when Daniel Harlow helped set me straight (any remaining howlers in my exposition are 100% mine, not his).

I believe the logic goes like this:

  1. Relativity—even just Galilean relativity—demands that, in flat space, the laws of physics must have the same form for all inertial observers (i.e., all observers who move through space at constant speed).
  2. Anything in the physical world that varies in space—say, a field that encodes different bits of information at different locations—also varies in time, from the perspective of an observer who moves through the field at a constant speed.
  3. Combining 1 and 2, we conclude that anything that can vary in space can also vary in time.  Or to say it better, there’s only one kind of varying: varying in spacetime.
  4. More strongly, special relativity tells us that there’s a specific numerical conversion factor between units of space and units of time: namely the speed of light, c.  Loosely speaking, this means that if we know the rate at which a field varies across space, we can also calculate the rate at which it varies across time, and vice versa.
  5. Anything that varies across time carries energy.  Why?  Because this is essentially the definition of energy in quantum mechanics!  Up to a constant multiple (namely, Planck’s constant), energy is the expected speed of rotation of the global phase of the wavefunction, when you apply your Hamiltonian.  If the global phase rotates at the slowest possible speed, then we take the energy to be zero, and say you’re in a vacuum state.  If it rotates at the next highest speed, we say you’re in a first excited state, and so on.  Indeed, assuming a time-independent Hamiltonian, the evolution of any quantum system can be fully described by simply decomposing the wavefunction into a superposition of energy eigenstates, then tracking of the phase of each eigenstate’s amplitude as it loops around and around the unit circle.  No energy means no looping around means nothing ever changes.
  6. Combining 3 and 5, any field that varies across space carries energy.
  7. More strongly, combining 4 and 5, if we know how quickly a field varies across space, we can lower-bound how much energy it has to contain.
  8. In general relativity, anything that carries energy couples to the gravitational field.  This means that anything that carries energy necessarily has an observable effect: if nothing else, its effect on the warping of spacetime.  (This is dramatically illustrated by dark matter, which is currently observable via its spacetime warping effect and nothing else.)
  9. Combining 6 and 8, any field that varies across space couples to the gravitational field.
  10. More strongly, combining 7 and 8, if we know how quickly a field varies across space, then we can lower-bound by how much it has to warp spacetime.  This is so because of another famous (and distinctive) feature of gravity: namely, the fact that it’s universally attractive, so all the warping contributions add up.
  11. But in GR, spacetime can only be warped by so much before we create a black hole: this is the famous Schwarzschild bound.
  12. Combining 10 and 11, the information contained in a physical field can only vary so quickly across space, before it causes spacetime to collapse to a black hole.

Summarizing where we’ve gotten, we could say: any information that’s spatially localized at all, can only be localized so precisely.  In our world, the more densely you try to pack 1’s and 0’s, the more energy you need, therefore the more you warp spacetime, until all you’ve gotten for your trouble is a black hole.  Furthermore, if we rewrote the above conceptual argument in math—keeping track of all the G’s, c’s, h’s, and so on—we could derive a quantitative bound on how much information there can be in a bounded region of space.  And if we were careful enough, that bound would be precisely the holographic entropy bound, which says that the number of (qu)bits is at most A/(4 ln 2), where A is the area of a bounding surface in Planck units.

Let’s pause to point out some interesting features of this argument.

Firstly, we pretty much needed the whole kitchen sink of basic physical principles: special relativity (both the equivalence of inertial frames and the finiteness of the speed of light), quantum mechanics (in the form of the universal relation between energy and frequency), and finally general relativity and gravity.  All three of the fundamental constants G, c, and h made appearances, which is why all three show up in the detailed statement of the holographic bound.

But secondly, gravity only appeared from step 8 onwards.  Up till then, everything could be said solely in the language of quantum field theory: that is, quantum mechanics plus special relativity.  The result would be the so-called Bekenstein bound, which upper-bounds the number of bits in any spatial region by the product of the region’s radius and its energy content.  I learned that there’s an interesting history here: Bekenstein originally deduced this bound using ingenious thought experiments involving black holes.  Only later did people realize that the Bekenstein bound can be derived purely within QFT (see here and here for example)—in contrast to the holographic bound, which really is a statement about quantum gravity.  (An early hint of this was that, while the holographic bound involves Newton’s gravitational constant G, the Bekenstein bound doesn’t.)

Thirdly, speaking of QFT, some readers might be struck by the fact that at no point in our 12-step program did we ever seem to need QFT machinery.  Which is fortunate, because if we had needed it, I wouldn’t have been able to explain any of this!  But here I have to confess that I cheated slightly.  Recall step 4, which said that “if you know the rate at which a field varies across space, you can calculate the rate at which it varies across time.”  It turns out that, in order to give that sentence a definite meaning, one uses the fact that in QFT, space and time derivatives in the Hamiltonian need to be related by a factor of c, since otherwise the Hamiltonian wouldn’t be Lorentz-invariant.

Fourthly, eagle-eyed readers might notice a loophole in the argument.  Namely, we never upper-bounded how much information God could add to the world, via fields that are constant across all of spacetime.  For example, there’s nothing to stop Her from creating a new scalar field that takes the same value everywhere in the universe—with that value, in suitable units, encoding 1050000 separate divine thoughts in its binary expansion.  But OK, being constant, such a field would interact with nothing and affect no observations—so Occam’s Razor itches to slice it off, by rewriting the laws of physics in a simpler form where that field is absent.  If you like, such a field would at most be a comment in the source code of the universe: it could be as long as the Great Programmer wanted it to be, but would have no observable effect on those of us living inside the program’s execution.


Of course, even before relativity and quantum mechanics, information had already been playing a surprisingly fleshy role in physics, through its appearance as entropy in 19th-century thermodynamics.  Which leads to another puzzle.  To a computer scientist, the concept of entropy, as the log of the number of microstates compatible with a given macrostate, seems clear enough, as does the intuition for why it should increase monotonically with time.  Or at least, to whatever extent we’re confused about these matters, we’re no more confused than the physicists are!

But then why should this information-theoretic concept be so closely connected to tangible quantities like temperature, and pressure, and energy?  From the mere assumption that a black hole has a nonzero entropy—that is, that it takes many bits to describe—how could Bekenstein and Hawking have possibly deduced that it also has a nonzero temperature?  Or: if you put your finger into a tub of hot water, does the heat that you feel somehow reflect how many bits are needed to describe the water’s microstate?

Once again our skeptic pipes up: “but surely God could stuff as many additional bits as She wanted into the microstate of the hot water—for example, in degrees of freedom that are still unknown to physics—without the new bits having any effect on the water’s temperature.”

But we should’ve learned by now to doubt this sort of argument.  There’s no general principle, in our universe, saying that you can hide as many bits as you want in a physical object, without those bits influencing the object’s observable properties.  On the contrary, in case after case, our laws of physics seem to be intolerant of “wallflower bits,” which hide in a corner without talking to anyone.  If a bit is there, the laws of physics want it to affect other nearby bits and be affected by them in turn.

In the case of thermodynamics, the assumption that does all the real work here is that of equidistribution.  That is, whatever degrees of freedom might be available to your thermal system, your gas in a box or whatever, we assume that they’re all already “as randomized as they could possibly be,” subject to a few observed properties like temperature and volume and pressure.  (At least, we assume that in classical thermodynamics.  Non-equilibrium thermodynamics is a whole different can of worms, worms that don’t stay in equilibrium.)  Crucially, we assume this despite the fact that we might not even know all the relevant degrees of freedom.

Why is this assumption justified?  “Because experiment bears it out,” the physics teacher explains—but we can do better.  The assumption is justified because, as long as the degrees of freedom that we’re talking about all interact with each other, they’ve already had plenty of time to equilibrate.  And conversely, if a degree of freedom doesn’t interact with the stuff we’re observing—or with anything that interacts with the stuff we’re observing, etc.—well then, who cares about it anyway?

But now, because the microscopic laws of physics have the fundamental property of reversibility—that is, they never destroy information—a new bit has to go somewhere, and it can’t overwrite degrees of freedom that are already fully randomized.  So this is essentially why, if you pump more bits of information into a tub of hot water, while keeping it at the same volume, physicsts are confident that the new bits will have nowhere to go except into pushing up the energy—which, in this case, means describing a greater range of possible speeds for the water molecules.  So then the molecules move faster on average, and your finger feels the water get hotter.


In summary, our laws of physics are structured in such a way that even pure information often has “nowhere to hide”: if the bits are there at all in the abstract machinery of the world, then they’re forced to pipe up and have a measurable effect.  And this is not a tautology, but comes about only because of nontrivial facts about special and general relativity, quantum mechanics, quantum field theory, and thermodynamics.  And this is what I think people should mean when they say “information is physical.”

Anyway, if this was all obvious to you, I apologize for having wasted your time!  But in my defense, it was never explained to me quite this way, nor was it sorted out in my head until recently—even though it seems like one of the most basic and general things one can possibly say about physics.


Endnotes. Thanks again to Daniel Harlow, not only for explaining the logic of the holographic bound to me but for several suggestions that improved this post.

Some readers might suspect circularity in the arguments we’ve made: are we merely saying that “any information that has observable physical consequences, has observable physical consequences”?  No, it’s more than that.  In all the examples I discussed, the magic was that we inserted certain information into our abstract mathematical description of the world, taking no care to ensure that the information’s presence would have any observable consequences whatsoever.  But then the principles of quantum mechanics, quantum gravity, or thermodynamics forced the information to be detectable in very specific ways (namely, via the destruction of quantum interference, the warping of spacetime, or the generation of heat respectively).

by Scott at July 21, 2017 01:30 AM UTC

Authors: Ruben Becker, Vincenzo Bonifaci, Andreas Karrenbauer, Pavel Kolev, Kurt Mehlhorn
Download: PDF
Abstract: In this paper, we present two results on slime mold computations. The first one treats a biologically-grounded model, originally proposed by biologists analyzing the behavior of the slime mold Physarum polycephalum. This primitive organism was empirically shown by Nakagaki et al. to solve shortest path problems in wet-lab experiments (Nature'00). We show that the proposed simple mathematical model actually generalizes to a much wider class of problems, namely undirected linear programs with a non-negative cost vector.

For our second result, we consider the discretization of a biologically-inspired model. This model is a directed variant of the biologically-grounded one and was never claimed to describe the behavior of a biological system. Straszak and Vishnoi showed that it can $\epsilon$-approximately solve flow problems (SODA'16) and even general linear programs with positive cost vector (ITCS'16) within a finite number of steps. We give a refined convergence analysis that improves the dependence on $\epsilon$ from polynomial to logarithmic and simultaneously allows to choose a step size that is independent of $\epsilon$. Furthermore, we show that the dynamics can be initialized with a more general set of (infeasible) starting points.

at July 21, 2017 12:00 AM UTC

Authors: Eldar Fischer, Frédéric Magniez, Tatiana Starikovskaya
Download: PDF
Abstract: In this paper we consider the problem of deciding membership in Dyck languages, a fundamental family of context-free languages, comprised of well-balanced strings of parentheses. In this problem we are given a string of length $n$ in the alphabet of parentheses of $m$ types and must decide if it is well-balanced. We consider this problem in the property testing setting, where one would like to make the decision while querying as few characters of the input as possible.

Property testing of strings for Dyck language membership for $m=1$, with a number of queries independent of the input size $n$, was provided in [Alon, Krivelevich, Newman and Szegedy, SICOMP 2001]. Property testing of strings for Dyck language membership for $m \ge 2$ was first investigated in [Parnas, Ron and Rubinfeld, RSA 2003]. They showed an upper bound and a lower bound for distinguishing strings belonging to the language from strings that are far (in terms of the Hamming distance) from the language, which are respectively (up to polylogarithmic factors) the $2/3$ power and the $1/11$ power of the input size $n$.

Here we improve the power of $n$ in both bounds. For the upper bound, we introduce a recursion technique, that together with a refinement of the methods in the original work provides a test for any power of $n$ larger than $2/5$. For the lower bound, we introduce a new problem called Truestring Equivalence, which is easily reducible to the $2$-type Dyck language property testing problem. For this new problem, we show a lower bound of $n$ to the power of $1/5$.

at July 21, 2017 12:00 AM UTC

Authors: Bin Sheng, Ruijuan Li, Gregory Gutin
Download: PDF
Abstract: The famous Chinese Postman Problem (CPP) is polynomial time solvable on both undirected and directed graphs. Gutin et al. [Discrete Applied Math 217 (2016)] generalized these results by proving that CPP on $c$-edge-colored graphs is polynomial time solvable for every $c\geq 2$. In CPP on weighted edge-colored graphs $G$, we wish to find a minimum weight properly colored closed walk containing all edges of $G$ (a walk is properly colored if every two consecutive edges are of different color, including the last and first edges in a closed walk). In this paper, we consider CPP on arc-colored digraphs (for properly colored closed directed walks), and provide a polynomial-time algorithm for the problem on weighted 2-arc-colored digraphs. This is a somewhat surprising result since it is NP-complete to decide whether a 2-arc-colored digraph has a properly colored directed cycle [Gutin et al., Discrete Math 191 (1998)]. To obtain the polynomial-time algorithm, we characterize 2-arc-colored digraphs containing properly colored Euler trails.

at July 21, 2017 12:00 AM UTC

Authors: Rajesh Chitnis, Andreas Emil Feldmann, Pasin Manurangsi
Download: PDF
Abstract: (Shortened abstract below, see paper for full abstract)

Given an edge-weighted directed graph $G=(V,E)$ and a set of $k$ terminal pairs $\{(s_i, t_i)\}_{i=1}^{k}$, the objective of the \pname{Directed Steiner Network (DSN)} problem is to compute the cheapest subgraph $N$ of $G$ such that there is an $s_i\to t_i$ path in $N$ for each $i\in [k]$. This problem is notoriously hard: there is no polytime $O(2^{\log^{1-\eps}n})$-approximation~[Dodis \& Khanna, \emph{STOC}~'99], and it is W[1]-hard~[Guo~et~al., \emph{SIDMA}~'11] for the well-studied parameter $k$. One option to circumvent these hardness results is to obtain a \emph{parameterized approximation}. Our first result is the following:

- Under Gap-ETH there is no $k^{o(1)}$-approximation for \pname{DSN} in $f(k)\cdot n^{O(1)}$ time for any function~$f$.

Given our results above, the question we explore in this paper is: can we obtain faster algorithms or better approximation ratios for other special cases of \pname{DSN} when parameterizing by $k$?

A well-studied special case of \pname{DSN} is the \pname{Strongly Connected Steiner Subgraph (SCSS)} problem, where the goal is to find the cheapest subgraph of $G$, which pairwise connects all $k$ given terminals. We show the following

- Under Gap-ETH no $(2-\eps)$-approximation can be computed in $f(k)\cdot n^{O(1)}$ time, for any function~$f$. This shows that the $2$-approximation in $2^k\polyn$ time observed in~[Chitnis~et~al., \emph{IPEC}~2013] is optimal.

Next we consider \dsn and \scss when the input graph is \emph{bidirected}, i.e., for every edge $uv$ the reverse edge exists and has the same weight. We call the respective problems \bidsn and \biscss and show:

- \biscss is NP-hard, but there is a \smash{$2^{O(2^{k^2-k})}\polyn$} time algorithm.

- In contrast, \bidsn is W[1]-hard parameterized by $k$.

at July 21, 2017 12:00 AM UTC

Authors: Ankush Agarwalla, John Augustine, William K. Moses Jr., Madhav Sankar K., Arvind Krishna Sridhar
Download: PDF
Abstract: In this work, we study the problem of dispersion of mobile robots on dynamic rings. The problem of dispersion of $n$ robots on an $n$ node graph, introduced by Augustine and Moses Jr.~\cite{AM17}, requires robots to coordinate with each other and reach a configuration where exactly one robot is present on each node. This problem has real world applications and applies whenever we want to minimize the total cost of $n$ agents sharing $n$ resources, located at various places, subject to the constraint that the cost of an agent moving to a different resource is comparatively much smaller than the cost of multiple agents sharing a resource (e.g. smart electric cars sharing recharge stations). The study of this problem also provides indirect benefits to the study of exploration by mobile robots and the study of load balancing on graphs.

We solve the problem of dispersion in the presence of two types of dynamism in the underlying graph: (i) vertex permutation and (ii) 1-interval connectivity. We introduce the notion of vertex permutation dynamism and have it mean that for a given set of nodes, in every round, the adversary ensures a ring structure is maintained, but the connections between the nodes may change. We use the idea of 1-interval connectivity from Di Luna et al.~\cite{LDFS16}, where for a given ring, in each round, the adversary chooses at most one edge to remove.

We assume robots have full visibility and present asymptotically time optimal algorithms to achieve dispersion in the presence of both types of dynamism when robots have chirality. When robots do not have chirality, we present asymptotically time optimal algorithms to achieve dispersion subject to certain constraints. Finally, we provide impossibility results for dispersion when robots have no visibility.

at July 21, 2017 12:00 AM UTC

Authors: Gonzalo Navarro
Download: PDF
Abstract: We consider document listing on string collections, that is, finding in which strings a given pattern appears. In particular, we focus on repetitive collections: a collection of size $N$ over alphabet $[1,\sigma]$ is composed of $D$ copies of a string of size $n$, and $s$ single-character or block edits are applied on ranges of copies. We introduce the first document listing index with size $\tilde{O}(n+s)$, precisely $O((n\log\sigma+s\log^2 N)\log D)$ bits, and with useful worst-case time guarantees: Given a pattern of length $m$, the index reports the $\ndoc$ strings where it appears in time $O(m^2 + m\log^{1+\epsilon} N \cdot \ndoc)$, for any constant $\epsilon>0$. Our technique is to augment a range data structure that is commonly used on grammar-based indexes, so that instead of retrieving all the pattern occurrences, it computes useful summaries on them. We show that the idea has independent interest: we introduce the first grammar-based index that, on a text $T[1,N]$ with a grammar of size $r$, uses $O(r\log N)$ bits and counts the number of occurrences of a pattern $P[1,m]$ in time $O(m^2 + m\log^{2+\epsilon} r)$, for any constant $\epsilon>0$.

at July 21, 2017 12:00 AM UTC

Authors: Nikhil Srivastava, Luca Trevisan
Download: PDF
Abstract: We prove the following Alon-Boppana type theorem for general (not necessarily regular) weighted graphs: if $G$ is an $n$-node weighted undirected graph of average combinatorial degree $d$ (that is, $G$ has $dn/2$ edges) and girth $g> 2d^{1/8}+1$, and if $\lambda_1 \leq \lambda_2 \leq \cdots \lambda_n$ are the eigenvalues of the (non-normalized) Laplacian of $G$, then \[ \frac {\lambda_n}{\lambda_2} \geq 1 + \frac 4{\sqrt d} - O \left( \frac 1{d^{\frac 58} }\right) \] (The Alon-Boppana theorem implies that if $G$ is unweighted and $d$-regular, then $\frac {\lambda_n}{\lambda_2} \geq 1 + \frac 4{\sqrt d} - O\left( \frac 1 d \right)$ if the diameter is at least $d^{1.5}$.)

Our result implies a lower bound for spectral sparsifiers. A graph $H$ is a spectral $\epsilon$-sparsifier of a graph $G$ if \[ L(G) \preceq L(H) \preceq (1+\epsilon) L(G) \] where $L(G)$ is the Laplacian matrix of $G$ and $L(H)$ is the Laplacian matrix of $H$. Batson, Spielman and Srivastava proved that for every $G$ there is an $\epsilon$-sparsifier $H$ of average degree $d$ where $\epsilon \approx \frac {4\sqrt 2}{\sqrt d}$ and the edges of $H$ are a (weighted) subset of the edges of $G$. Batson, Spielman and Srivastava also show that the bound on $\epsilon$ cannot be reduced below $\approx \frac 2{\sqrt d}$ when $G$ is a clique; our Alon-Boppana-type result implies that $\epsilon$ cannot be reduced below $\approx \frac 4{\sqrt d}$ when $G$ comes from a family of expanders of super-constant degree and super-constant girth.

The method of Batson, Spielman and Srivastava proves a more general result, about sparsifying sums of rank-one matrices, and their method applies to an "online" setting. We show that for the online matrix setting the $4\sqrt 2 / \sqrt d$ bound is tight, up to lower order terms.

at July 21, 2017 12:00 AM UTC

Authors: Erik D. Demaine, Quanquan C. Liu
Download: PDF
Abstract: Pebble games are single-player games on DAGs involving placing and moving pebbles on nodes of the graph according to a certain set of rules. The goal is to pebble a set of target nodes using a minimum number of pebbles. In this paper, we present a possibly simpler proof of the result in [CLNV15] and strengthen the result to show that it is PSPACE-hard to determine the minimum number of pebbles to an additive $n^{1/3-\epsilon}$ term for all $\epsilon > 0$, which improves upon the currently known additive constant hardness of approximation [CLNV15] in the standard pebble game. We also introduce a family of explicit, constant indegree graphs with $n$ nodes where there exists a graph in the family such that using constant $k$ pebbles requires $\Omega(n^k)$ moves to pebble in both the standard and black-white pebble games. This independently answers an open question summarized in [Nor15] of whether a family of DAGs exists that meets the upper bound of $O(n^k)$ moves using constant $k$ pebbles with a different construction than that presented in [AdRNV17].

at July 21, 2017 12:40 AM UTC

Authors: Jacob Holm, Eva Rotenberg, Mikkel Thorup
Download: PDF
Abstract: We present a deterministic fully-dynamic data structure for maintaining information about the bridges in a graph. We support updates in $\tilde{O}((\log n)^2)$ amortized time, and can find a bridge in the component of any given vertex, or a bridge separating any two given vertices, in $O(\log n / \log \log n)$ worst case time. Our bounds match the current best for bounds for deterministic fully-dynamic connectivity up to $\log\log n$ factors. The previous best dynamic bridge finding was an $\tilde{O}((\log n)^3)$ amortized time algorithm by Thorup [STOC2000], which was a bittrick-based improvement on the $O((\log n)^4)$ amortized time algorithm by Holm et al.[STOC98, JACM2001].

Our approach is based on a different and purely combinatorial improvement of the algorithm of Holm et al., which by itself gives a new combinatorial $\tilde{O}((\log n)^3)$ amortized time algorithm. Combining it with Thorup's bittrick, we get down to the claimed $\tilde{O}((\log n)^2)$ amortized time.

We also offer improvements in space. We describe a general trick which applies to both of our new algorithms, and to the old ones, to get down to linear space, where the previous best use $O(m + n\log n\log\log n)$. Finally, we show how to obtain $O(\log n/\log \log n)$ query time, matching the optimal trade-off between update and query time.

Our result yields an improved running time for deciding whether a unique perfect matching exists in a static graph.

at July 21, 2017 12:00 AM UTC

Authors: Andreas Schmid, Jens M. Schmidt
Download: PDF
Abstract: Tutte paths are one of the most successful tools for attacking Hamiltonicity problems in planar graphs. Unfortunately, results based on them are non-constructive, as their proofs inherently use an induction on overlapping subgraphs and these overlaps hinder to bound the running time to a polynomial. For special cases however, computational results of Tutte paths are known: For 4-connected planar graphs, Tutte paths are in fact Hamiltonian paths and Chiba and Nishizeki showed how to compute such paths in linear time. For 3-connected planar graphs, Tutte paths have a more complicated structure, and it has only recently been shown that they can be computed in polynomial time. However, Tutte paths are defined for general 2-connected planar graphs and this is what most applications need. Unfortunately, no computational results are known. We give the first efficient algorithm that computes a Tutte path (for the general case of 2-connected planar graphs). One of the strongest existence results about such Tutte paths is due to Sanders, which allows to prescribe the end vertices and an intermediate edge of the desired path. Encompassing and strengthening all previous computational results on Tutte paths, we show how to compute this special Tutte path efficiently. Our method refines both, the results of Thomassen and Sanders, and avoids overlapping subgraphs by using a novel iterative decomposition along 2-separators. Finally, we show that our algorithm runs in quadratic time.

at July 21, 2017 12:00 AM UTC

Here is a game (Darling says I only blog about non-fun games. This post will NOT prove her wrong.)

Let D be a domain,  d ≥  1 and 0 ≠ a0  ∈ D. There are two players Wanda (for Wants root) and Nora (for No root). One of the players is Player I, the other Player II.

(1) Player I and II alternate (with Player I going first) choosing the coefficients in D of a polynomial of degree d with the constant term preset to a0.

(2) When they are done, if there is a root in D  then Wanda wins, else Nora wins.

There is a paper by Gasarch-Washington-Zbarsky here where we determine who wins the game when D is Z,Q (these proofs are elementary), any finite extension of Q (this proof uses hard number theory), R, C (actually any algebraic closed field), and any finite field.

How did I think of this game?  There was a paper called Greedy Galois Games (which I blogged about here). When I saw the title I thought the game might be that players pick coefficients from Q and if the final polynomial has a solution in radicals then (say) Player I wins. That was not correct. They only use that Galois was a bad duelist. Even so, the paper INSPIRED me! Hence the paper above! The motivating problem is still open:

Open Question: Let d be at least 5. Play the above game except that (1) the coefficients are out of Q, and (2) Wanda wins if the final poly is solvable by radicals, otherwise Nora wins. (Note that if d=1,2,3,4 then Wanda wins.) Who wins?

If they had named their game Hamilton Game (since Alexander Hamilton lost a duel) I might have been inspired to come up with a game about quaternions or Hamiltonian cycles.

POINT- take ideas for problems from any source, even an incorrect guess about a paper!




by GASARCH ([email protected]) at July 20, 2017 09:47 PM UTC

Authors: Martin Nägele, Benny Sudakov, Rico Zenklusen
Download: PDF
Abstract: Submodular function minimization (SFM) is a fundamental and efficiently solvable problem class in combinatorial optimization with a multitude of applications in various fields. Surprisingly, there is only very little known about constraint types under which SFM remains efficiently solvable. The arguably most relevant non-trivial constraint class for which polynomial SFM algorithms are known are parity constraints, i.e., optimizing only over sets of odd (or even) cardinality. Parity constraints capture classical combinatorial optimization problems like the odd-cut problem, and they are a key tool in a recent technique to efficiently solve integer programs with a constraint matrix whose subdeterminants are bounded by two in absolute value.

We show that efficient SFM is possible even for a significantly larger class than parity constraints, by introducing a new approach that combines techniques from Combinatorial Optimization, Combinatorics, and Number Theory. In particular, we can show that efficient SFM is possible over all sets (of any given lattice) of cardinality r mod m, as long as m is a constant prime power. This covers generalizations of the odd-cut problem with open complexity status, and with relevance in the context of integer programming with higher subdeterminants. To obtain our results, we establish a connection between the correctness of a natural algorithm, and the inexistence of set systems with specific combinatorial properties. We introduce a general technique to disprove the existence of such set systems, which allows for obtaining extensions of our results beyond the above-mentioned setting. These extensions settle two open questions raised by Geelen and Kapadia [Combinatorica, 2017] in the context of computing the girth and cogirth of certain types of binary matroids.

at July 20, 2017 12:00 AM UTC

Authors: Hiroshi Saito, Hirotada Honda
Download: PDF
Abstract: We geometrically analyze the problem of estimating parameters related to the shape and size of a two-dimensional target object on the plane by using randomly distributed distance sensors whose locations are unknown. Based on the analysis using geometric probability, we discuss the observability of these parameters: which parameters we can estimate and what conditions are required to estimate them. For a convex target object, its size and perimeter length are observable, and other parameters are not observable. For a general polygon target object, convexity in addition to its size and perimeter length is observable. Parameters related to a concave vertex can be observable when some conditions are satisfied. We also propose a method for estimating the convexity of a target object and the perimeter length of the target object.

at July 20, 2017 12:00 AM UTC

Authors: Stefan Felsner, Tamás Mészáros, Piotr Micek
Download: PDF
Abstract: The dimension of a partially ordered set $P$ is an extensively studied parameter. Small dimension allows succinct encoding. Indeed if $P$ has dimension $d$, then to know whether $x < y$ in $P$ it is enough to check whether $x<y$ in each of the $d$ linear extensions of a witnessing realizer. Focusing on the encoding aspect Ne\v{s}et\v{r}il and Pudl\'{a}k defined the boolean dimension so that $P$ has boolean dimension at most $d$ if it is possible to decide whether $x < y$ in $P$ by looking at the relative position of $x$ and $y$ in only $d$ permutations of the elements of $P$.

Our main result is that posets with cover graphs of bounded tree-width have bounded boolean dimension. This stays in contrast with the fact that there are posets with cover graphs of tree-width three and arbitrarily large dimension.

As a corollary, we obtain a labeling scheme of size $\log(n)$ for reachability queries on $n$-vertex digraphs with bounded tree-width. Our techniques seem to be very different from the usual approach for labeling schemes which is based on divide-and-conquer decompositions of the input graphs.

at July 20, 2017 12:00 AM UTC

Authors: Anton V. Eremeev, Alexander V. Kelmanov, Artem V. Pyatkin, Igor A. Ziegler
Download: PDF
Abstract: In this paper, we consider the problem of finding a maximum cardinality subset of vectors, given a constraint on the normalized squared length of vectors sum. This problem is closely related to Problem 1 from (Eremeev, Kel'manov, Pyatkin, 2016). The main difference consists in swapping the constraint with the optimization criterion.

We prove that the problem is NP-hard even in terms of finding a feasible solution. An exact algorithm for solving this problem is proposed. The algorithm has a pseudo-polynomial time complexity in the special case of the problem, where the dimension of the space is bounded from above by a constant and the input data are integer. A computational experiment is carried out, where the proposed algorithm is compared to COINBONMIN solver, applied to a mixed integer quadratic programming formulation of the problem. The results of the experiment indicate superiority of the proposed algorithm when the dimension of Euclidean space is low, while the COINBONMIN has an advantage for larger dimensions.

at July 20, 2017 12:00 AM UTC

Authors: Aaron Bernstein, Jacob Holm, Eva Rotenberg
Download: PDF
Abstract: In the online bipartite matching problem with replacements, all the vertices on one side of the bipartition are given, and the vertices on the other side arrive one by one with all their incident edges. The goal is to maintain a maximum matching while minimizing the number of changes (replacements) to the matching. We show that the greedy algorithm that always takes the shortest augmenting path from the newly inserted vertex (denoted the SAP protocol) uses at most amortized $O(\log^2 n)$ replacements per insertion, where $n$ is the total number of vertices inserted. This is the first analysis to achieve a polylogarithmic number of replacements for \emph{any} replacement strategy, almost matching the $\Omega(\log n)$ lower bound. The previous best known strategy achieved amortized $O(\sqrt{n})$ replacements [Bosek, Leniowski, Sankowski, Zych, FOCS 2014]. For the SAP protocol in particular, nothing better than then trivial $O(n)$ bound was known except in special cases.

Our analysis immediately implies the same upper bound of $O(\log^2 n)$ reassignments for the capacitated assignment problem, where each vertex on the static side of the bipartition is initialized with the capacity to serve a number of vertices.

We also analyze the problem of minimizing the maximum server load. We show that if the final graph has maximum server load $L$, then the SAP protocol makes amortized $O( \min\{L \log^2 n , \sqrt{n}\log n\})$ reassignments. We also show that this is close to tight because $\Omega(\min\{L, \sqrt{n}\})$ reassignments can be necessary.

at July 20, 2017 12:00 AM UTC

Authors: Paweł Gawrychowski, Jakub Łopuszański
Download: PDF
Abstract: A labeling scheme for nearest common ancestors assigns a distinct binary string, called the label, to every node of a tree, so that given the labels of two nodes (and no further information about the topology of the tree) we can compute the label of their nearest common ancestor. The goal is to make the labels as short as possible. Alstrup, Gavoille, Kaplan, and Rauhe [Theor. Comput. Syst. 37(3):441-456 2004] showed that $O(\log n)$-bit labels are enough. More recently, Alstrup, Halvorsen, and Larsen [SODA 2014] refined this to only $2.772\log n$, and provided a lower bound of $1.008\log n$.

We connect designing a labeling scheme for nearest common ancestors to the existence of a tree, called a minor-universal tree, that contains every tree on $n$ nodes as a topological minor. Even though it is not clear if a labeling scheme must be based on such a notion, we argue that the existing schemes can be reformulated as such, and it allows us to obtain clean and good bounds on the length of the labels. As the main upper bound, we show that $2.318\log n$-bit labels are enough. Surprisingly, the notion of a minor-universal tree for binary trees on $n$ nodes has been already used in a different context by Hrubes et al. [CCC 2010], and Young, Chu, and Wong [J. ACM 46(3):416-435, 1999] introduced a closely related notion of a universal tree. On the lower bound side, we show that the size of any minor-universal tree for trees on $n$ nodes is $\Omega(n^{2.174})$. This highlights a natural limitation for all approaches based on such trees. Our lower bound technique also implies that the size of any universal tree in the sense of Young et al. is $\Omega(n^{2.185})$, thus dramatically improves their lower bound of $\Omega(n\log n)$. We complement the existential results with a generic transformation that decreases the query time to constant in any scheme based on a minor-universal tree.

at July 20, 2017 12:00 AM UTC

Authors: Pranav Kant Gaur, S. K. Bose
Download: PDF
Abstract: In this article, recent works on 2D Constrained Delaunay triangulation(CDT) algorithms have been reported. Since the review of CDT algorithms presented by de Floriani(Issues on Machine Vision, Springer Vienna, pg. 95--104, 1989), different algorithms for construction and applications of CDT have appeared in literature each concerned with different aspects of computation and different suitabilities. Therefore, objective of this work is to report an update over that review article considering contemporary prominent algorithms and generalizations for the problem of two dimensional Constrained Delaunay triangulation.

at July 20, 2017 12:00 AM UTC

Authors: Martin Grohe, Nicole Schweikardt
Download: PDF
Abstract: We study an extension of first-order logic that allows to express cardinality conditions in a similar way as SQL's COUNT operator. The corresponding logic FOC(P) was introduced by Kuske and Schweikardt (LICS'17), who showed that query evaluation for this logic is fixed-parameter tractable on classes of structures (or databases) of bounded degree. In the present paper, we first show that the fixed-parameter tractability of FOC(P) cannot even be generalised to very simple classes of structures of unbounded degree such as unranked trees or strings with a linear order relation.

Then we identify a fragment FOC1(P) of FOC(P) which is still sufficiently strong to express standard applications of SQL's COUNT operator. Our main result shows that query evaluation for FOC1(P) is fixed-parameter tractable with almost linear running time on nowhere dense classes of structures. As a corollary, we also obtain a fixed-parameter tractable algorithm for counting the number of tuples satisfying a query over nowhere dense classes of structures.

at July 20, 2017 12:00 AM UTC

Authors: Jun Zhang, Qi Cheng
Download: PDF
Abstract: In their celebrated paper "On Siegel's Lemma", Bombieri and Vaaler found an upper bound on the height of integer solutions of systems of linear Diophantine equations. Calculating the bound directly, however, requires exponential time. In this paper, we present the bound in a different form that can be computed in polynomial time. We also give an elementary (and arguably simpler) proof for the bound.

at July 20, 2017 02:01 AM UTC

Authors: Michael Mitzenmacher, Tom Morgan
Download: PDF
Abstract: We explore a generalization of set reconciliation, where the goal is to reconcile sets of sets. Alice and Bob each have a parent set consisting of $s$ child sets, each containing at most $h$ elements from a universe of size $u$. They want to reconcile their sets of sets in a scenario where the total number of differences between all of their child sets (under the minimum difference matching between their child sets) is $d$. We give several algorithms for this problem, and discuss applications to reconciliation problems on graphs, databases, and collections of documents. We specifically focus on graph reconciliation, providing protocols based on sets of sets reconciliation for random graphs from $G(n,p)$ and for forests of rooted trees.

at July 20, 2017 12:00 AM UTC

Authors: Vassil Dimitrov, Diego Coelho
Download: PDF
Abstract: This paper proposes new factorizations for computing the Neumann series. The factorizations are based on fast algorithms for small prime sizes series and the splitting of large sizes into several smaller ones. We propose a different basis for factorizations other than the well-known binary and ternary basis. We show that is possible to reduce the overall complexity for the usual binary decomposition from 2log2(N)-2 multiplications to around 1.72log2(N)-2 using a basis of size five. Merging different basis we can demonstrate that we can build fast algorithms for particular sizes. We also show the asymptotic case where one can reduce the number of multiplications to around 1.70log2(N)-2. Simulations are performed for applications in the context of wireless communications and image rendering, where is necessary perform large sized matrices inversion.

at July 20, 2017 12:00 AM UTC

Authors: Dhiraj Holden
Download: PDF
Abstract: We show the first unconditional pseudo-determinism result for all of search-BPP. Specifically, we show that every BPP search problem can be computed pseudo-deterministically on average for infinitely many input lengths. In other words, for infinitely many input lengths and for any polynomial-time samplable distribution our algorithm succeeds in producing a unique answer (if one exists) with high probability over the distribution and the coins tossed.

at July 20, 2017 12:00 AM UTC

I recently added to the incidence theory book two chapters about incidences in (Chapters 8 and 9). At the moment incidence bounds in are known for several different cases, and it is rather unclear how a general incidence bound in should look like. In this post I would like to state what I believe this […]

by Adam Sheffer at July 19, 2017 08:19 PM UTC

A core, emerging problem in nonconvex optimization involves the escape of saddle points. While recent research has shown that gradient descent (GD) generically escapes saddle points asymptotically (see Rong Ge’s and Ben Recht’s blog posts), the critical open problem is one of efficiency — is GD able to move past saddle points quickly, or can it be slowed down significantly? How does the rate of escape scale with the ambient dimensionality? In this post, we describe our recent work with Rong Ge, Praneeth Netrapalli and Sham Kakade, that provides the first provable positive answer to the efficiency question, showing that, rather surprisingly, GD augmented with suitable perturbations escapes saddle points efficiently; indeed, in terms of rate and dimension dependence it is almost as if the saddle points aren’t there!

Perturbing Gradient Descent

We are in the realm of classical gradient descent (GD) — given a function $f:\mathbb{R}^d \to \mathbb{R}$ we aim to minimize the function by moving in the direction of the negative gradient:

where $x_t$ are the iterates and $\eta$ is the step size. GD is well understood theorietically in the case of convex optimization, but the general case of nonconvex optimization has been far less studied. We know that GD converges quickly to the neighborhood of stationary points (points where $\nabla f(x) = 0$) in the nonconvex setting, but these stationary points may be local minima or, unhelpfully, local maxima or saddle points.

Clearly GD will never move away from a stationary point if started there (even a local maximum); thus, to provide general guarantees, it is necessary to modify GD slightly to incorporate some degree of randomness. Two simple methods have been studied in the literature:

  1. Intermittent Perturbations: Ge, Huang, Jin and Yuan 2015 considered adding occasional random perturbations to GD, and were able to provide the first polynomial time guarantee for GD to escape saddle points. (See also Rong Ge’s post )

  2. Random Initialization: Lee et al. 2016 showed that with only random initialization, GD provably avoids saddle points asymptotically (i.e., as the number of steps goes to infinity). (see also Ben Recht’s post)

Asymptotic — and even polynomial time —results are important for the general theory, but they stop short of explaining the success of gradient-based algorithms in practical nonconvex problems. And they fail to provide reassurance that runs of GD can be trusted — that we won’t find ourselves in a situation in which the learning curve flattens out for an indefinite amount of time, with the user having no way of knowing that the asymptotics have not yet kicked in. Lastly, they fail to provide reassurance that GD has the kind of favorable properties in high dimensions that it is known to have for convex problems.

One reasonable approach to this issue is to consider second-order (Hessian-based) algorithms. Although these algorithms are generally (far) more expensive per iteration than GD, and can be more complicated to implement, they do provide the kind of geometric information around saddle points that allows for efficient escape. Accordingly, a reasonable understanding of Hessian-based algorithms has emerged in the literature, and positive efficiency results have been obtained.

Is GD also efficient? Or is the Hessian necessary for fast escape of saddle points?

A negative result emerges to this first question if one considers the random initialization strategy discussed. Indeed, this approach is provably inefficient in general, taking exponential time to escape saddle points in the worst case (see “On the Necessity of Adding Perturbations” section).

Somewhat surprisingly, it turns out that we obtain a rather different — and positive — result if we consider the perturbation strategy. To be able to state this result, let us be clear on the algorithm that we analyze:

Perturbed gradient descent (PGD)

  1. for $~t = 1, 2, \ldots ~$ do
  2. $\quad\quad x_{t} \leftarrow x_{t-1} - \eta \nabla f (x_{t-1})$
  3. $\quad\quad$ if $~$perturbation condition holds$~$ then
  4. $\quad\quad\quad\quad x_t \leftarrow x_t + \xi_t$

Here the perturbation $\xi_t$ is sampled uniformly from a ball centered at zero with a suitably small radius, and is added to the iterate when the gradient is suitably small. These particular choices are made for analytic convenience; we do not believe that uniform noise is necessary. nor do we believe it essential that noise be added only when the gradient is small.

Strict-Saddle and Second-order Stationary Points

We define saddle points in this post to include both classical saddle points as well as local maxima. They are stationary points which are locally maximized along at least one direction. Saddle points and local minima can be categorized according to the minimum eigenvalue of Hessian:

We further call the saddle points in the last category, where $\lambda_{\min}(\nabla^2 f(x)) < 0$, strict saddle points.

Strict and Non-strict Saddle Point

While non-strict saddle points can be flat in the valley, strict saddle points require that there is at least one direction along which the curvature is strictly negative. The presence of such a direction gives a gradient-based algorithm the possibility of escaping the saddle point. In general, distinguishing local minima and non-strict saddle points is NP-hard; therefore, we — and previous authors — focus on escaping strict saddle points.

Formally, we make the following two standard assumptions regarding smoothness.

Assumption 1: $f$ is $\ell$-gradient-Lipschitz, i.e.
$\quad\quad\quad\quad \forall x_1, x_2, |\nabla f(x_1) - \nabla f(x_2)| \le \ell |x_1 - x_2|$.
$~$
Assumption 2: $f$ is $\rho$-Hessian-Lipschitz, i.e.
$\quad\quad\quad\quad \forall x_1, x_2$, $|\nabla^2 f(x_1) - \nabla^2 f(x_2)| \le \rho |x_1 - x_2|$.

Similarly to classical theory, which studies convergence to a first-order stationary point, $\nabla f(x) = 0$, by bounding the number of iterations to find a $\epsilon$-first-order stationary point, $|\nabla f(x)| \le \epsilon$, we formulate the speed of escape of strict saddle points and the ensuing convergence to a second-order stationary point, $\nabla f(x) = 0, \lambda_{\min}(\nabla^2 f(x)) \ge 0$, with an $\epsilon$-version of the definition:

Definition: A point $x$ is an $\epsilon$-second-order stationary point if:
$\quad\quad\quad\quad |f(x)|\le \epsilon$, and $\lambda_{\min}(\nabla^2 f(x)) \ge -\sqrt{\rho \epsilon}$.

In this definition, $\rho$ is the Hessian Lipschitz constant introduced above. This scaling follows the convention of Nesterov and Polyak 2006.

Applications

In a wide range of practical nonconvex problems it has been proved that all saddle points are strict — such problems include, but not are limited to, principal components analysis, canonical correlation analysis, orthogonal tensor decomposition, phase retrieval, dictionary learning, matrix sensing, matrix completion, and other nonconvex low-rank problems.

Furthermore, in all of these nonconvex problems, it also turns out that all local minima are global minima. Thus, in these cases, any general efficient algorithm for finding $\epsilon$-second-order stationary points immediately becomes an efficient algorithm for solving those nonconvex problem with global guarantees.

Escaping Saddle Point with Negligible Overhead

In the classical case of first-order stationary points, GD is known to have very favorable theoretical properties:

Theorem (Nesterov 1998): If Assumption 1 holds, then GD, with $\eta = 1/\ell$, finds an $\epsilon$-first-order stationary point in $2\ell (f(x_0) - f^\star)/\epsilon^2$ iterations.

In this theorem, $x_0$ is the initial point and $f^\star$ is the function value of the global minimum. The theorem says for that any gradient-Lipschitz function, a stationary point can be found by GD in $O(1/\epsilon^2)$ steps, with no explicit dependence on $d$. This is called “dimension-free optimization” in the literature; of course the cost of a gradient computation is $O(d)$, and thus the overall runtime of GD scales as $O(d)$. The linear scaling in $d$ is especially important for modern high-dimensional nonconvex problems such as deep learning.

We now wish to address the corresponding problem for second-order stationary points. What is the best we can hope for? Can we also achieve

  1. A dimension-free number of iterations;
  2. An $O(1/\epsilon^2)$ convergence rate;
  3. The same dependence on $\ell$ and $(f(x_0) - f^\star)$ as in (Nesterov 1998)?

Rather surprisingly, the answer is Yes to all three questions (up to small log factors).

Main Theorem: If Assumptions 1 and 2 hold, then PGD, with $\eta = O(1/\ell)$, finds an $\epsilon$-second-order stationary point in $\tilde{O}(\ell (f(x_0) - f^\star)/\epsilon^2)$ iterations with high probability.

Here $\tilde{O}(\cdot)$ hides only logarithmic factors; indeed, the dimension dependence in our result is only $\log^4(d)$. The theorem thus asserts that a perturbed form of GD, under an additional Hessian-Lipschitz condition, converges to a second-order-stationary point in almost the same time required for GD to converge to a first-order-stationary point. In this sense, we claim that PGD can escape strict saddle points almost for free.

We turn to a discussion of some of the intuitions underlying these results.

Why do polylog(d) iterations suffice?

Our strict-saddle assumption means that there is only, in the worst case, one direction in $d$ dimensions along which we can escape. A naive search for the descent direction intuitively should take at least $\text{poly}(d)$ iterations, so why should only $\text{polylog}(d)$ suffice?

Consider a simple case in which we assume that the function is quadratic in the neighborhood of the saddle point. That is, let the objective function be $f(x) = x^\top H x$, a saddle point at zero, with constant Hessian $H = \text{diag}(-1, 1, \cdots, 1)$. In this case, only the first direction is an escape direction (with negative eigenvalue $-1$).

It is straightforward to work out the general form of the iterates in this case:

Assume that we start at the saddle point at zero, then add a perturbation so that $x_0$ is sampled uniformly from a ball $\mathcal{B}_0(1)$ centered at zero with radius one. The decrease in the function value can be expressed as:

Set the step size to be $1/2$, let $\lambda_i$ denote the $i$-th eigenvalue of the Hessian $H$ and let $\alpha_i = e_i^\top x_0$ denote the component in the $i$th direction of the initial point $x_0$. We have $\sum_{i=1}^d \alpha_i^2 = | x_0|^2 = 1$, thus:

A simple probability argument shows that sampling uniformly in $\mathcal{B}_0(1)$ will result in at least a $\Omega(1/d)$ component in the first direction with high probability. That is, $\alpha^2_1 = \Omega(1/d)$. Substituting $\alpha_1$ in the above equation, we see that it takes at most $O(\log d)$ steps for the function value to decrease by a constant amount.

Pancake-shape stuck region for general Hessian

We can conclude that for the case of a constant Hessian, only when the perturbation $x_0$ lands in the set $\{x | ~ |e_1^\top x|^2 \le O(1/d)\}$ $\cap \mathcal{B}_0 (1)$, can we take a very long time to escape the saddle point. We call this set the stuck region; in this case it is a flat disk. In general, when the Hessian is no longer constant, the stuck region becomes a non-flat pancake, depicted as a green object in the left graph. In general this region will not have an analytic expression.

Earlier attempts to analyze the dynamics around saddle points tried to the approximate stuck region by a flat set. This results in a requirement of an extremely small step size and a correspondingly very large runtime complexity. Our sharp rate depends on a key observation — although we don’t know the shape of the stuck region, we know it is very thin.

Pancake

In order to characterize the “thinness” of this pancake, we studied pairs of hypothetical perturbation points $w, u$ separated by $O(1/\sqrt{d})$ along an escaping direction. We claim that if we run GD starting at $w$ and $u$, at least one of the resulting trajectories will escape the saddle point very quickly. This implies that the thickness of the stuck region can be at most $O(1/\sqrt{d})$, so a random perturbation has very little chance to land in the stuck region.

On the Necessity of Adding Perturbations

We have discussed two possible ways to modify the standard gradient descent algorithm, the first by adding intermittent perturbations, and the second by relying on random initialization. Although the latter exhibits asymptotic convergence, it does not yield efficient convergence in general; in recent joint work with Simon Du, Jason Lee, Barnabas Poczos, and Aarti Singh, we have shown that even with fairly natural random initialization schemes and non-pathological functions, GD with only random initialization can be significantly slowed by saddle points, taking exponential time to escape. The behavior of PGD is strikingingly different — it can generically escape saddle points in polynomial time.

To establish this result, we considered random initializations from a very general class including Gaussians and uniform distributions over the hypercube, and we constructed a smooth objective function that satisfies both Assumptions 1 and 2. This function is constructed such that, even with random initialization, with high probability both GD and PGD have to travel sequentially in the vicinity of $d$ strict saddle points before reaching a local minimum. All strict saddle points have only one direction of escape. (See the left graph for the case of $d=2$).

NecessityPerturbation

When GD travels in the vicinity of a sequence of saddle points, it can get closer and closer to the later saddle points, and thereby take longer and longer to escape. Indeed, the time to escape the $i$th saddle point scales as $e^{i}$. On the other hand, PGD is always able to escape any saddle point in a small number of steps independent of the history. This phenomenon is confirmed by our experiments; see, for example, an experiment with $d=10$ in the right graph.

Conclusion

In this post, we have shown that a perturbed form of gradient descent can converge to a second-order-stationary point at almost the same rate as standard gradient descent converges to a first-order-stationary point. This implies that Hessian information is not necessary for to escape saddle points efficiently, and helps to explain why basic gradient-based algorithms such as GD (and SGD) work surprisingly well in the nonconvex setting. This new line of sharp convergence results can be directly applied to nonconvex problem such as matrix sensing/completion to establish efficient global convergence rates.

There are of course still many open problems in general nonconvex optimization. To name a few: will adding momentum improve the convergence rate to a second-order stationary point? What type of local minima are tractable and are there useful structural assumptions that we can impose on local minima so as to avoid local minima efficiently? We are making slow but steady progress on nonconvex optimization, and there is the hope that at some point we will transition from “black art” to “science”.

at July 19, 2017 10:00 AM UTC

The UCSD machine learning and theory groups are looking for strong post-doctoral applicants. Interest in broadly-defined data science is preferred, but all excellent candidates will be considered.

by Moses Charikar at July 19, 2017 06:05 AM UTC

Nikhil Devanur, one of the PC Chairs for WINE 2017, presents the following announcement. Please consider sending a paper to WINE this year!


Reminder: The WINE 2017 deadline is in roughly 2 weeks, on August 2. The conference will be held during December 17-20, 2017, at the Indian Institute of Science, Bangalore, India.

 

The CFP is here. http://lcm.csa.iisc.ernet.in/wine2017/papers.html

Submission server is here: https://easychair.org/conferences/?conf=wine2017

 

Some points to raise awareness.

  1. A selection of papers accepted to the conference is invited to GEB/TEAC.
  2. WINE has had the best paper award for a couple of years now. In some years the committee has chosen not to give the award to any paper, but the option has been there.
  3. WINE is seen as being narrower than EC, and some subset of the EC community doesn’t typically submit to WINE. We would like to broaden the scope of WINE to include all those topics covered in EC. In particular, we welcome and encourage papers from the AI, OR, and Econ communities.
  4. We also welcome experimental papers, and in particular papers that have both a theoretical and an experimental part are strongly encouraged.

We are trying different things to make WINE be a top tier conference, but it all starts with you submitting good papers.

 


by robertkleinberg at July 19, 2017 04:18 AM UTC

Authors: Dimitris Fotakis, Vasilis Kontonis, Piotr Krysta, Paul Spirakis
Download: PDF
Abstract: We introduce the problem of simultaneously learning all powers of a Poisson Binomial Distribution (PBD). A PBD of order $n$ is the distribution of a sum of $n$ mutually independent Bernoulli random variables $X_i$, where $\mathbb{E}[X_i] = p_i$. The $k$'th power of this distribution, for $k$ in a range $[m]$, is the distribution of $P_k = \sum_{i=1}^n X_i^{(k)}$, where each Bernoulli random variable $X_i^{(k)}$ has $\mathbb{E}[X_i^{(k)}] = (p_i)^k$. The learning algorithm can query any power $P_k$ several times and succeeds in learning all powers in the range, if with probability at least $1- \delta$: given any $k \in [m]$, it returns a probability distribution $Q_k$ with total variation distance from $P_k$ at most $\epsilon$. We provide almost matching lower and upper bounds on query complexity for this problem. We first show a lower bound on the query complexity on PBD powers instances with many distinct parameters $p_i$ which are separated, and we almost match this lower bound by examining the query complexity of simultaneously learning all the powers of a special class of PBD's resembling the PBD's of our lower bound. We study the fundamental setting of a Binomial distribution, and provide an optimal algorithm which uses $O(1/\epsilon^2)$ samples. Diakonikolas, Kane and Stewart [COLT'16] showed a lower bound of $\Omega(2^{1/\epsilon})$ samples to learn the $p_i$'s within error $\epsilon$. The question whether sampling from powers of PBDs can reduce this sampling complexity, has a negative answer since we show that the exponential number of samples is inevitable. Having sampling access to the powers of a PBD we then give a nearly optimal algorithm that learns its $p_i$'s. To prove our two last lower bounds we extend the classical minimax risk definition from statistics to estimating functions of sequences of distributions.

at July 19, 2017 12:57 AM UTC

Authors: John Augustine, William K. Moses Jr
Download: PDF
Abstract: We introduce a new problem in the domain of mobile robots, which we term dispersion. In this problem, $n$ robots are placed in an $n$ node graph arbitrarily and must coordinate with each other to reach a final configuration such that exactly one robot is at each node. We study this problem through the lenses of minimizing the memory required by each robot and of minimizing the number of rounds required to achieve dispersion.

Dispersion is of interest due to its relationship to the problem of exploration using mobile robots and the problem of load balancing on a graph. Additionally, dispersion has an immediate real world application due to its relationship to the problem of recharging electric cars, as each car can be considered a robot and recharging stations and the roads connecting them nodes and edges of a graph respectively. Since recharging is a costly affair relative to traveling, we want to distribute these cars amongst the various available recharge points where communication should be limited to car-to-car interactions.

We provide lower bounds on both the memory required for robots to achieve dispersion and the minimum running time to achieve dispersion on any type of graph. We then analyze the trade-offs between time and memory for various types of graphs. We provide time optimal and memory optimal algorithms for several types of graphs and show the power of a little memory in terms of running time.

at July 19, 2017 12:52 AM UTC

Authors: Rodrigo Canovas, Bastien Cazaux, Eric Rivals
Download: PDF
Abstract: For analysing text algorithms, for computing superstrings, or for testing random number generators, one needs to compute all overlaps between any pairs of words in a given set. The positions of overlaps of a word onto itself, or of two words, are needed to compute the absence probability of a word in a random text, or the numbers of common words shared by two random texts. In all these contexts, one needs to compute or to query overlaps between pairs of words in a given set. For this sake, we designed COvI, a compressed overlap index that supports multiple queries on overlaps: like computing the correlation of two words, or listing pairs of words whose longest overlap is maximal among all possible pairs. COvI stores overlaps in a hierarchical and non-redundant manner. We propose an implementation that can handle datasets of millions of words and still answer queries efficiently. Comparison with a baseline solution - called FullAC - relying on the Aho-Corasick automaton shows that COvI provides significant advantages. For similar construction times, COvI requires half the memory FullAC, and still solves complex queries much faster.

at July 19, 2017 12:50 AM UTC

Authors: Nikhil Bansal, Martin Böhm, Marek Eliáš, Grigorios Koumoutsos, Seeun William Umboh
Download: PDF
Abstract: In the Convex Body Chasing problem, we are given an initial point $v_0$ in $R^d$ and an online sequence of $n$ convex bodies $F_1, ..., F_n$. When we receive $F_i$, we are required to move inside $F_i$. Our goal is to minimize the total distance travelled. This fundamental online problem was first studied by Friedman and Linial (DCG 1993). They proved an $\Omega(\sqrt{d})$ lower bound on the competitive ratio, and conjectured that a competitive ratio depending only on d is possible. However, despite much interest in the problem, the conjecture remains wide open.

We consider the setting in which the convex bodies are nested: $F_1 \supset ... \supset F_n$. The nested setting is closely related to extending the online LP framework of Buchbinder and Naor (ESA 2005) to arbitrary linear constraints. Moreover, this setting retains much of the difficulty of the general setting and captures an essential obstacle in resolving Friedman and Linial's conjecture. In this work, we give the first $f(d)$-competitive algorithm for chasing nested convex bodies in $R^d$.

at July 19, 2017 12:50 AM UTC

Authors: Maryam Aliakbarpour, Ilias Diakonikolas, Ronitt Rubinfeld
Download: PDF
Abstract: We investigate the problems of identity and closeness testing over a discrete population from random samples. Our goal is to develop efficient testers while guaranteeing Differential Privacy to the individuals of the population. We describe an approach that yields sample-efficient differentially private testers for these problems. Our theoretical results show that there exist private identity and closeness testers that are nearly as sample-efficient as their non-private counterparts. We perform an experimental evaluation of our algorithms on synthetic data. Our experiments illustrate that our private testers achieve small type I and type II errors with sample size sublinear in the domain size of the underlying distributions.

at July 19, 2017 12:46 AM UTC

Authors: Andrzej Grzesik, Tereza Klimošová, Marcin Pilipczuk, Michał Pilipczuk
Download: PDF
Abstract: In the classic Maximum Weight Independent Set problem we are given a graph $G$ with a nonnegative weight function on vertices, and the goal is to find an independent set in $G$ of maximum possible weight. While the problem is NP-hard in general, we give a polynomial-time algorithm working on any $P_6$-free graph, that is, a graph that has no path on $6$ vertices as an induced subgraph. This improves the polynomial-time algorithm on $P_5$-free graphs of Lokshtanov et al. (SODA 2014), and the quasipolynomial-time algorithm on $P_6$-free graphs of Lokshtanov et al (SODA 2016). The main technical contribution leading to our main result is enumeration of a polynomial-size family $\mathcal{F}$ of vertex subsets with the following property: for every maximal independent set $I$ in the graph, $\mathcal{F}$ contains all maximal cliques of some minimal chordal completion of $G$ that does not add any edge incident to a vertex of $I$.

at July 19, 2017 12:48 AM UTC

Authors: Qianggong Zhang, Tat-Jun Chin
Download: PDF
Abstract: Multiple-view triangulation by $\ell_{\infty}$ minimisation has become established in computer vision. State-of-the-art $\ell_{\infty}$ triangulation algorithms exploit the quasiconvexity of the cost function to derive iterative update rules that deliver the global minimum. Such algorithms, however, can be computationally costly for large problem instances that contain many image measurements, e.g., from web-based photo sharing sites or long-term video recordings. In this paper, we prove that $\ell_{\infty}$ triangulation admits a coreset approximation scheme, which seeks small representative subsets of the input data called coresets. A coreset possesses the special property that the error of the $\ell_{\infty}$ solution on the coreset is within known bounds from the global minimum. We establish the necessary mathematical underpinnings of the coreset algorithm, specifically, by enacting the stopping criterion of the algorithm and proving that the resulting coreset gives the desired approximation accuracy. On large-scale triangulation problems, our method provides theoretically sound approximate solutions. Iterated until convergence, our coreset algorithm is also guaranteed to reach the true optimum. On practical datasets, we show that our technique can in fact attain the global minimiser much faster than current methods

at July 19, 2017 12:00 AM UTC

Authors: Adil Erzin
Download: PDF
Abstract: In this paper, we propose several regular covers with identical sectors that observe the strip in the same direction, and we carried out their analysis, which allows choosing the best coverage model and the best parameters of the sector.

at July 19, 2017 12:00 AM UTC

Authors: Oswin Aichholzer, Thomas Hackl, Matias Korman, Marc van Kreveld, Maarten Löffler, Alexander Pilz, Bettina Speckmann, Emo Welzl
Download: PDF
Abstract: We consider the following question: How many edge-disjoint plane spanning trees are contained in a complete geometric graph $GK_n$ on any set $S$ of $n$ points in general position in the plane? We show that this number is in $\Omega(\sqrt{n})$. Further, we consider variants of this problem by bounding the diameter and the degree of the trees (in particular considering spanning paths).

at July 19, 2017 01:02 AM UTC

Authors: Sushmita Gupta, Saket Saurabh, Meirav Zehavi
Download: PDF
Abstract: Stable Marriage is a fundamental problem to both computer science and economics. Four well-known NP-hard optimization versions of this problem are the Sex-Equal Stable Marriage (SESM), Balanced Stable Marriage (BSM), max-Stable Marriage with Ties (max-SMT) and min-Stable Marriage with Ties (min-SMT) problems. In this paper, we analyze these problems from the viewpoint of Parameterized Complexity. We conduct the first study of these problems with respect to the parameter treewidth. First, we study the treewidth $\mathtt{tw}$ of the primal graph. We establish that all four problems are W[1]-hard. In particular, while it is easy to show that all four problems admit algorithms that run in time $n^{O(\mathtt{tw})}$, we prove that all of these algorithms are likely to be essentially optimal. Next, we study the treewidth $\mathtt{tw}$ of the rotation digraph. In this context, the max-SMT and min-SMT are not defined. For both SESM and BSM, we design (non-trivial) algorithms that run in time $2^{\mathtt{tw}}n^{O(1)}$. Then, for both SESM and BSM, we also prove that unless SETH is false, algorithms that run in time $(2-\epsilon)^{\mathtt{tw}}n^{O(1)}$ do not exist for any fixed $\epsilon>0$. We thus present a comprehensive, complete picture of the behavior of central optimization versions of Stable Marriage with respect to treewidth.

at July 19, 2017 12:00 AM UTC

Authors: Arash Haddadan, Alantha Newman, R. Ravi
Download: PDF
Abstract: We explore and introduce tools to design algorithms with improved approximation guarantees for edge connectivity problems such as the traveling salesman problem (TSP) and the 2-edge-connected spanning multigraph problem (2EC) on (restricted classes of) weighted graphs. In particular, we investigate decompositions of graphs that cover small-cardinality cuts an even number of times. Such decompositions allow us to design approximation algorithms that consist of augmenting an initial subgraph (e.g. cycle cover or spanning tree) with a cheap addition to form a feasible solution. Boyd, Iwata and Takazawa recently presented an algorithm to find a cycle cover that covers all 3- and 4-edge cuts in a bridgeless, cubic graph with applications to 2EC on 3-edge-connected, cubic graphs [BIT 2013]. We present several applications of their algorithm to connectivity problems on node-weighted graphs, which has been suggested as an intermediate step between graph and general metrics. Specifically, on 3-edge-connected, cubic, node-weighted graphs, we present a $\frac{7}{5}$-approximation algorithm for TSP and a $\frac{13}{10}$-approximation algorithm for 2EC. Moreover, we use a cycle cover that covers all 3- and 4-edge cuts to show that the everywhere $\frac{18}{19}$ vector is a convex combination of incidence vectors of tours, answering a question of Seb{\H{o}}.

Extending this approach to input graphs that are 2-edge connected necessitates finding methods for covering 2-edge cuts. We therefore present a procedure to decompose an optimal solution for the subtour linear program into spanning, connected subgraphs that cover each 2-edge cut an even number of times. We use this decomposition to design a simple $\frac{4}{3}$-approximation algorithm for the 2-edge connected problem on subcubic, node-weighted graphs.

at July 19, 2017 12:52 AM UTC

Authors: Shay Moran, Amir Yehudayoff
Download: PDF
Abstract: We show that the Radon number characterizes the existence of weak nets in separable convexity spaces (an abstraction of the euclidean notion of convexity). The construction of weak nets when the Radon number is finite is based on Helly's property and on metric properties of VC classes. The lower bound on the size of weak nets when the Radon number is large relies on the chromatic number of the Kneser graph. As an application, we prove a boosting-type result for weak $\epsilon$-nets.

at July 19, 2017 12:00 AM UTC